Capturing Figshare Repositories

This demo serves no other purpose than to illustrate the technical capabilities of the Memento Tracer Framework for creating high quality captures of web publications. It has been selected because it illustrates the ability of the framework to capture resources that would be very difficult, if not impossible, to capture using regular web crawling approaches.

Step 1 - A Curator Records a Trace Using the Memento Tracer Browser Extension

The below screencast shows the process of creating a Trace, using the Memento Tracer browser extension, on the basis of the Figshare repository https://figshare.com/articles/Beyond_Throughput_a_4G_LTE_Dataset_with_Channel_and_Context_Metrics/6153497. The extension intercepts all interactions (e.g., clicks) of the curator with this Figshare repository and records them as a Trace, serialized in JSON. In this Trace, each interaction is expressed such that the entity that is subject to interaction is uniquely and abstractly identified. That is, the interaction is not tied to this specific repository on which the Trace is recorded. As a result, the recorded Trace will be re-usable in Step 3 to automatically interact with other, similar repositories, in order to generate quality captures.

Legend for the above screencast:

The curator navigates to the Figshare repository https://figshare.com/articles/Beyond_Throughput_a_4G_LTE_Dataset_with_Channel_and_Context_Metrics/6153497 and activates the Memento Tracer browser extension.
The curator wants captures of Figshare repositories to include the repository as a ZIP file. This is achived by:

In the extension, the curator activates the "Record Clicks" option and then clicks the "Download all" button in the Figshare repository page. The extension records this interaction.
The curator deactivates the extension's "Record Clicks" option.

The curator wants captures of Figshare repositories to also include the individual files in the repository. This is achieved as follows:
- The curator activates the extension's "Select All Links in an Area" option, which highlights blocks within a page on mouseover, navigates to the area of the repository page that lists the various files/directories as thumbnails, and then clicks inside this area. The extension records this interaction in such a way that it conveys that all resources linked in this area, which exists in all Figshare repositories, need to be captured.
- The curator deactivates the extension's "Select All Links in an Area" option.
The curator would also like to include previews of the data files in the repository. Figshare provides a way to preview the data files in a repository by clicking on the thumbnail of the data file. And one can navigate to the other files in the repository by clicking on the next and previous buttons available from the preview page. The browser extension provides a way to capture these repeated actions so that they can work on any other Figshare repository with arbitrary number of data files:

The curator navigates to the thumbnail previews of one of the data files in the repository and clicks it. This opens a page that renders the data file previews in Figshare.
The curator then activates the extension's "Automate Repeated Button Clicks" option.
Now, the curator proceeds to choose the exit condition for when these repeated clicks must stop. It happens that in Figshare, when the preview of the last page is reached, the next button symbolized by ">" becomes disabled. Hence, the curator selects "Button is Disabled" under the "Repeat Clicks Until" section.
The curator now clicks on the next button ">", and the extension records this event in such a way that this next button will be clicked repeatedly in any Figshare repository until it becomes disabled.

The curator copies the recorded interactions, expressed as JSON, from the extension, pastes them to a file, and saves them as figshare.json.

Step 2 - A Curator Uploads the Trace to a Shared Repository

Once a Trace is successfully recorded, the curator uploads it to a shared community repository. This repository does not yet exist but its design is being investigated. When it goes live, it will contain the below Trace, resulting from Step 1. For now, the Trace is being passed on manually to the operator in charge of Step 3. In this Trace, note portal_url_match, which has a value the URL pattern (expressed as a regular expression) for which this Trace applies.


  {  
    "portal_url_match": "(figshare.com)\/articles\/([^\/]+)\/([^\/]+)",   
    "actions": [
      {
        "action": "click",
        "action_order": 1,
        "action_apply": "once",
        "type": "CSSSelector",
        "value": "span.file-size"
      },
      {
        "action": "click",
        "action_order": 2,
        "action_apply": "all",
        "type": "CSSSelector",
        "value": "div.fv-loader a"
      },
      {
        "action": "click",
        "action_order": 4,
        "action_apply": "once",
        "type": "CSSSelector",
        "value": "button.fv-file-view"
      },
      {
        "action": "repeated_click",
        "action_order":5,
        "repeat_until": {
          "condition": "exists",
          "type": "CSSSelector",
          "value": ["button.fs-next-page[disabled],button.fs-next-page.disabled"]
        },
      "type": "CSSSelector",
      "value": "button.fs-next-page"
      }
    ]           
  }

Step 3 - An Operator of a Headless Browser Set-Up Uses Traces from the Shared Repository

In order to generate captures, the Memento Tracer framework currently uses a set-up consisting of:

Selenium WebDriver, which automates the actions of a Google Chrome headless browser.
The WarcProxy capturing tool that writes resources navigated by the headless browser to a WARC file.
Storm-Crawler to manage the crawling process.

The current Tracer Framework set-up consists of two instances of the Storm-Crawler running simultaneously. The primary Storm-Crawler instance performs the sequence of interactions necessary for a successful Trace, and offloads the capturing of the out-links in a page to the secondary Storm-Crawler instance. Both the primary and the secondary Storm-Crawlers use the same WarcProxy and Selenium WebDriver for capturing.

The below screencast shows the major steps involved in the capture process.

Legend for the above screencast:

The WarcProxy is started and is waiting for incoming requests to load.
The Google Chrome web browser is started and is running in the foreground, listening for a signal from the Selenium WebDriver regarding URLs to load. As such, all requests made by the headless browser pass via the WarcProxy, which records those requests and their responses into a WARC file.
The Storm-Crawler is started and provided with URL of the FigShare repository https://figshare.com/articles/Biomass_flow_and_water_efficiency_of_cactus_pear_under_different_managements_in_the_Brazilian_Semiarid/6124955 that must be crawled. Note that this is a different FigShare repository than the one used to create the Trace in Step 1.
Given this URL, Storm-Crawler checks whether it has a Trace for a URL pattern (see portal_url_match above) that matches the URL at hand. In this case, the crawler will match the URL to the Trace shown in Step 2, and will start executing its sequence of interactions on https://figshare.com/articles/Biomass_flow_and_water_efficiency_of_cactus_pear_under_different_managements_in_the_Brazilian_Semiarid/6124955. The many URLs that are being crawled are shown in log messages outputted by the Storm-Crawler.
Once the crawler captured all resources, it is stopped. The WarcProxy is also stopped, and the resulting WARC file, which contains the FigShare repository https://figshare.com/articles/Biomass_flow_and_water_efficiency_of_cactus_pear_under_different_managements_in_the_Brazilian_Semiarid/6124955 captured according to the above Trace, is saved.

Result

The below screencast shows the WARC file that results from Step 3 being opened in the webrecorder.io webrecorderplayer. All resources that were captured are listed in the opening screen once the WARC has been loaded. The URL of the captured Figshare repository is clicked. Further interactions follow, which illustrate that - by using the Trace - all resources that the curator wanted to capture for Figshare repositories have indeed been captured. Resources that were not selected by the curator were not captured.

The WARC file is available for download.

Compare with the capture of the same repository using the Internet Archive's Save Page Now feature around the same time the above WARC file was created.

Last update: May 23 2018