The impact of JavaScript on archivability

@article{Brunelle2015TheIO,
  title={The impact of JavaScript on archivability},
  author={Justin F. Brunelle and Mat Kelly and Michele C. Weigle and Michael L. Nelson},
  journal={International Journal on Digital Libraries},
  year={2015},
  volume={17},
  pages={95-117}
}
As web technologies evolve, web archivists work to adapt so that digital history is preserved. Recent advances in web technologies have introduced client-side executed scripts (Ajax) that, for example, load data without a change in top level Universal Resource Identifier (URI) or require user interaction (e.g., content loading via Ajax when the page has scrolled). These advances have made automating methods for capturing web pages more difficult. In an effort to understand why mementos… 
Archival Crawlers and JavaScript: Discover More Stuff but Crawl More Slowly
TLDR
This work proposes a method of discovering and archiving deferred representations and their descendants (representation states) that are only reachable through client- side events that was applied to the July 2015 Common Crawl dataset, demonstrating the significant increase in resources necessary for more thorough archival crawls.
Scripts in a frame: A framework for archiving deferred representations
TLDR
Using the approaches detailed in this dissertation, archives can create mementos closer to what users experience rather than archiving the crawlers’ experiences on the Web, using the performance trade-offs between traditional archival tools and technologies that better archive JavaScript.
The Archival Acid Test: Evaluating archive performance on advanced HTML and JavaScript
TLDR
This paper proposes a set of metrics to evaluate the capability of archival crawlers and other preservation tools using the Acid Test concept, and designs the test to produce a quantitative measure of how well each tool performs its task.
Interoperability for Accessing Versions of Web Resources with the Memento Protocol
TLDR
A variety of tools and services that leverage the broad adoption of the Memento Protocol are described and a selection of research efforts that would likely not have been possible without these interoperability standards are discussed.
Archiving Deferred Representations Using a Two-Tiered Crawling Approach
TLDR
This work uses 10,000 seed Universal Resource Identifiers to explore the impact of including PhantomJS -- a headless browsing tool -- into the crawling process by comparing the performance of wget, PhantomJS, and Heritrix.
To Relive the Web: A Framework for the Transformation and Archival Replay of Web Pages
TLDR
This thesis proposes terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos in order to facilitate replay, and proposes a general framework for the auto-generation of client-side rewriting libraries.
WAIL: Collection-Based Personal Web Archiving
TLDR
This work recreate and extend WAIL from the ground up to facilitate collection-based personal Web archiving, replacing OpenWayback with PyWb, and provides a novel means for personal Webarchivists to curate collections of their captures from their own personal computer rather than relying on an external archival Web service.
If these crawls could talk: Studying and documenting web archives provenance
TLDR
The decision space of web archives is examined and its role in shaping what is and what is not captured in the web archiving process is examined, and a framework for documenting key dimensions of a collection is proposed that addresses the situated nature of the organizational context, technical specificities, and unique characteristics of web materials that are the focus of acollection.
A Framework for Verifying the Fixity of Archived Web Resources
TLDR
A framework for establishing and checking the fixity on the playback of archived resources, or mementos is presented, built based on well-known web archiving standards, such as the Memento protocol.
Leveraging Heritrix and the Wayback Machine on a Corporate Intranet: A Case Study on Improving Corporate Archives
TLDR
The challenges of Intranet web archiving are outlined, situations in which the open source tools are not well suited for the needs of the corporate archivists are identified, and recommendations for future corporatearchivists wishing to use such tools are made.
...
1
2
3
...

References

SHOWING 1-10 OF 85 REFERENCES
On the Change in Archivability of Websites Over Time
TLDR
It is shown that the archivability of a web page can be deduced from the type of page being archived, which aligns with that page’s accessibility in respect to dynamic content.
The Archival Acid Test: Evaluating archive performance on advanced HTML and JavaScript
TLDR
This paper proposes a set of metrics to evaluate the capability of archival crawlers and other preservation tools using the Acid Test concept, and designs the test to produce a quantitative measure of how well each tool performs its task.
Crawling Ajax-Based Web Applications through Dynamic Analysis of User Interface State Changes
TLDR
A novel technique for crawling Ajax-based applications through automatic dynamic analysis of user-interface-state changes in Web browsers, and incrementally infers a state machine that models the various navigational paths and states within an Ajax application.
Memento: Time Travel for the Web
TLDR
The Memento solution is a framework in which archived resources can seamlessly be reached via the URI of their original: protocol-based time travel for the Web.
Web archiving in a Web 2.0 world
  • E. Crook
  • Computer Science
    Electron. Libr.
  • 2009
TLDR
The current state of web archiving in Australia is discussed, and how libraries are adapting their services in recognition of the expanding role that online material plays in their collections is discussed.
How much of the web is archived?
TLDR
The Memento Project's archive access additions to HTTP have enabled development of new web archive access user interfaces, and approximating the Web via sampling URIs from DMOZ, Delicious, Bitly, and search engine indexes and measuring number of archive copies available in various public web archives indicates that 35%-90% of URIs have at least one archived copy.
Gulfstream: Incremental Static Analysis for Streaming JavaScript Applications
TLDR
This paper advocate the use of combined offline-online static analysis as a way to accomplish fast, online incremental analysis at the expense of a more thorough and costly offline analysis on the static code.
AJAXSearch: crawling, indexing and searching web 2.0 applications
TLDR
The demo presents the AJAX search engine: crawler, indexer and query processor, applied on a real application and showcases challenges and solutions.
Not all mementos are created equal: measuring the impact of missing resources
TLDR
It is shown that Web users’ perceptions of damage are not accurately estimated by the proportion of missing embedded resources, and a damage rating algorithm is proposed that provides closer alignment to Web user perception.
AjaxScope: a platform for remotely monitoring the client-side behavior of web 2.0 applications
TLDR
AjaxScope is a dynamic instrumentation platform that enables cross-user monitoring and just-in-time control of web application behavior on end-user desktops, and presents a variety of policies demonstrating the power of AjaxScope.
...
1
2
3
4
5
...