Capturing GitHub Repositories

This demo serves no other purpose than to illustrate the technical capabilities of the Memento Tracer Framework for creating high quality captures of web publications. It has been selected because it illustrates the ability of the framework to capture resources that would be very difficult, if not impossible, to capture using regular web crawling approaches.



Step 1 - A Curator Records a Trace Using the Memento Tracer Browser Extension

The below screencast shows the process of creating a Trace, using the Memento Tracer browser extension, on the basis of the GitHub repository https://github.com/gorilla/mux. The extension intercepts all interactions (e.g., clicks) of the curator with this GitHub repository and records them as a Trace, serialized in JSON. In this Trace, each interaction is expressed such that the entity that is subject to interaction is uniquely and abstractly identified. That is, the interaction is not tied to this specific repository on which the Trace is recorded. As a result, the recorded Trace will be re-usable in Step 3 to automatically interact with other, similar repositories, in order to generate quality captures.



Legend for the above screencast:
  • The curator navigates to the GitHub repository https://github.com/gorilla/mux and activates the Memento Tracer browser extension.
  • The curator wants captures of GitHub repositories to include the repository as a ZIP file. The GitHub interface supports that, but the download button only becomes available in a popup that opens when another button is clicked first. The extension supports recording these consecutive actions in the appropriate order:
    • In the extension, the curator activates the "Record Clicks" option and then clicks "Clone or Download" in the GitHub repository page. The extension records this interaction.
    • With the extension's "Record Clicks" option still activated, the curator now clicks "Download ZIP" from the popup that resulted from the previous click. The extension records this interaction as well as the fact that it must occur after the previous one.
    • The curator deactivates the extension's "Record Clicks" option.
  • The curator wants captures of GitHub repositories to also include the GitHub rendering of the files that are listed for a repository. This is achieved as follows:
    • The curator activates the extension's "Select All Links in an Area" option, which highlights blocks within a page on mouseover, navigates to the area of the repository page that lists the various files/directories as hyperlinks, and then clicks inside this area. The extension records this interaction in such a way that it conveys that all resources linked in this area, which exists in all GitHub repositories, need to be captured.
  • The curator copies the recorded interactions, expressed as JSON, from the extension, pastes them to a file, and saves them as github.json.

Step 2 - A Curator Uploads the Trace to a Shared Repository

Once a Trace is successfully recorded, the curator uploads it to a shared community repository. This repository does not yet exist but its design is being investigated. When it goes live, it will contain the below Trace, resulting from Step 1. For now, the Trace is being passed on manually to the operator in charge of Step 3. In this Trace, note portal_url_match, which has a value the URL pattern (expressed as a regular expression) for which this Trace applies.

{
  "portal_url_match": "(github.com)\/([^\/]+)\/([^\/]+)",
  "actions": [{
      "action_order": "1",
      "value": "summary.btn.btn-sm.btn-primary",
      "type": "CSSSelector",
      "action": "click"
    },
    {
      "action_order": "2",
      "value": "id(\"js-repo-pjax-container\")/div[2]/div[1]/div[5]/details[1]/div[1]/div[1]/div[1]/div[2]/a[2]",
      "type": "XPath",
      "action": "click"
    },
    {
      "action_order": "3",
      "value": "table.files.js-navigation-container.js-active-navigation-container a",
      "type": "CSSSelector",
      "action": "click"
    }
  ],
  "action_count": 3,
  "resource_url": "https://github.com/gorilla/mux",
  "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3431.0 Safari/537.36"
}  
    

Step 3 - An Operator of a Headless Browser Set-Up Uses Traces from the Shared Repository

In order to generate captures, the Memento Tracer framework currently uses a set-up consisting of:
  • Selenium WebDriver, which automates the actions of a Google Chrome headless browser.
  • The WarcProxy capturing tool that writes resources navigated by the headless browser to a WARC file.
  • Storm-Crawler to manage the crawling process.
The current Tracer Framework set-up consists of two instances of the Storm-Crawler running simultaneously. The primary Storm-Crawler instance performs the sequence of interactions necessary for a successful Trace, and offloads the capturing of the out-links in a page to the secondary Storm-Crawler instance. Both the primary and the secondary Storm-Crawlers use the same WarcProxy and Selenium WebDriver for capturing.
The below screencast shows the major steps involved in the capture process.



Legend for the above screencast:
  • The WarcProxy is started and is waiting for incoming requests to load.
  • The Google Chrome web browser is started and is running in the foreground, listening for a signal from the Selenium WebDriver regarding URLs to load. As such, all requests made by the headless browser pass via the WarcProxy, which records those requests and their responses into a WARC file.
  • The Storm-Crawler is started and provided with URL of the GitHub repository https://github.com/mementoweb/node-solid-server that must be crawled. Note that this is a different GitHub repository than the one used to create the Trace in Step 1.
  • Given this URL, Storm-Crawler checks whether it has a Trace for a URL pattern (see portal_url_match above) that matches the URL at hand. In this case, the crawler will match the URL to the Trace shown in Step 2, and will start executing its sequence of interactions on https://github.com/mementoweb/node-solid-server. The many URLs that are being crawled are shown in log messages outputted by the Storm-Crawler.
  • Once the crawler captured all resources, it is stopped. The WarcProxy is also stopped, and the resulting WARC file, which contains the Github repository https://github.com/mementoweb/node-solid-server captured according to the above Trace, is saved.

Result

The below screencast shows the WARC file that results from Step 3 being opened in the webrecorder.io webrecorderplayer. All resources that were captured are listed in the opening screen once the WARC has been loaded. The URL of the captured GitHub repository is clicked. Further interactions follow, which illustrate that - by using the Trace - all resources that the curator wanted to capture for GitHub repositories have indeed been captured. Resources that were not selected by the curator (e.g., resources at depth two) were not captured.

The WARC file is available for download.

Compare with the capture of the same repository using the Internet Archive's Save Page Now feature around the same time the above WARC file was created.




Last update: May 23 2018