Capturing SlideShare Presentations

This demo serves no other purpose than to illustrate the technical capabilities of the Memento Tracer Framework for creating high quality captures of web publications. It has been selected because it illustrates the ability of the framework to capture resources that would be very difficult, if not impossible, to capture using regular web crawling approaches.

Step 1 - A Curator Records a Trace Using the Memento Tracer Browser Extension

The below screencast shows the process of creating a Trace, using the Memento Tracer browser extension, on the basis of the SlideShare presentation https://www.slideshare.net/hvdsomp/paul-evan-peters-lecture. The extension intercepts all interactions (e.g., clicks) of the curator with this SlideShare presentation and records them as a Trace, serialized in JSON. In this Trace, each interaction is expressed such that the entity that is subject to interaction is uniquely and abstractly identified. That is, the interaction is not tied to this specific repository on which the Trace is recorded. As a result, the recorded Trace will be re-usable in Step 3 to automatically interact with other, similar repositories, in order to generate quality captures.

Legend for the above screencast:

The curator navigates to the SlideShare presentation https://www.slideshare.net/hvdsomp/paul-evan-peters-lecture and activates the Memento Tracer browser extension.
The curator wants captures of every slide in a SlideShare presentation. The SlideShare interface provides a way to navigate slides in a presentation by clicking on the next and previous buttons. The browser extension provides a way to capture these repeated actions so that they can work on any other SlideShare presentation with an arbitrary number of slides:
- The curator activates the extension's "Automate Repeated Button Clicks" option.
- Now, the curator proceeds to choose the exit condition to determine when these repeated clicks must stop. In a SlideShare presentation, when the next button is clicked after loading the last slide of a presentation, SlideShare automatically loads another presentation. When this new presentation is loaded, the URL of the resource changes to that of the URL of the new presentation. Hence, the curator selects "Navigated to a New Resource" under the "Repeat Clicks Until" section.
- The curator now clicks the next button once and this interaction is recorded accordingly.
- The curator deactivates the extension's "Automate Repeated Button Clicks" option.
The curator copies the recorded interaction, expressed as JSON, from the extension, pastes them to a file, and saves them as slideshare.json.

Step 2 - A Curator Uploads the Trace to a Shared Repository

Once a Trace is successfully recorded, the curator uploads it to a shared community repository. This repository does not yet exist but its design is being investigated. When it goes live, it will contain the below Trace, resulting from Step 1. For now, the Trace is being passed on manually to the operator in charge of Step 3. In this Trace, note portal_url_match, which has a value the URL pattern (expressed as a regular expression) for which this Trace applies.


  {
    "portal_url_match": "(slideshare.net)\/([^\/]+)\/([^\/]+)",
    "actions": [
    {
      "action_order": "1",
      "value": "div.j-next-btn.arrow-right",
      "type": "CSSSelector",
      "action": "repeated_click",
      "repeat_until": {
        "condition": "changes",
        "type": "resource_url"
      }
    }],
    "resource_url": "https://www.slideshare.net/hvdsomp/paul-evan-peters-lecture",
    "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3417.0 Safari/537.36"
  }

Step 3 - An Operator of a Headless Browser Set-Up Uses Traces from the Shared Repository

In order to generate captures, the Memento Tracer framework currently uses a set-up consisting of:

Selenium WebDriver, which automates the actions of a Google Chrome headless browser.
The WarcProxy capturing tool that writes resources navigated by the headless browser to a WARC file.
Storm-Crawler to manage the crawling process.

The current Tracer Framework set-up consists of two instances of the Storm-Crawler running simultaneously. The primary Storm-Crawler instance performs the sequence of interactions necessary for a successful Trace, and offloads the capturing of the out-links in a page to the secondary Storm-Crawler instance. Both the primary and the secondary Storm-Crawlers use the same WarcProxy and Selenium WebDriver for capturing.

The below screencast shows the major steps involved in the capture process.

Legend for the above screencast:

The WarcProxy is started and is waiting for incoming requests to load.
The Google Chrome web browser is started and is running in the foreground, listening for a signal from the Selenium WebDriver regarding URLs to load. As such, all requests made by the headless browser pass via the WarcProxy, which records those requests and their responses into a WARC file.
The Storm-Crawler is started and provided with URL of the SlideShare presentation https://pt.slideshare.net/elfpavlik/api-standardization-work-in-w3c-groups/ that must be crawled. Note that this is a different SlideShare presentation than the one used to create the Trace in Step 1.
Given this URL, Storm-Crawler checks whether it has a Trace for a URL pattern (see portal_url_match above) that matches the URL at hand. In this case, the crawler will match the URL to the Trace shown in Step 2, and will start executing its sequence of interactions on https://pt.slideshare.net/elfpavlik/api-standardization-work-in-w3c-groups/. The many URLs that are being crawled are shown in log messages outputted by the Storm-Crawler.
Once the crawler captured all resources, it is stopped. The WarcProxy is also stopped, and the resulting WARC file, which contains the SlideShare presentation https://pt.slideshare.net/elfpavlik/api-standardization-work-in-w3c-groups/ captured according to the above Trace, is saved.

Result

The below screencast shows the WARC file that results from Step 3 being opened in the webrecorder.io webrecorderplayer. All resources that were captured are listed in the opening screen once the WARC has been loaded. The URL of the captured SlideShare presentation is clicked. Further interactions follow, which illustrate that - by using the Trace - all resources that the curator wanted to capture for a SlideShare presentation have indeed been captured. Resources that were not selected by the curator were not captured.

The WARC file is available for download.

Compare with the capture of the same presentation using the Internet Archive's Save Page Now feature around the same time the above WARC file was created.

Last update: May 23 2018