• Corpus ID: 49311510

The Many Shapes of Archive-It

  title={The Many Shapes of Archive-It},
  author={Shawn M. Jones and Alexander C. Nwala and Michele C. Weigle and Michael L. Nelson},
Web archives, a key area of digital preservation, meet the needs of journalists, social scientists, historians, and government organizations. The use cases for these groups often require that they guide the archiving process themselves, selecting their own original resources, or seeds, and creating their own web archive collections. We focus on the collections within Archive-It, a subscription service started by the Internet Archive in 2005 for the purpose of allowing organizations to create… 

Figures and Tables from this paper

Improving Collection Understanding in Web Archives

Focusing on Archive-It, this work seeks to identify the different types of web archive collections, the algorithms that can be used summarize those collections, and the best visualizations of those summaries to support better collection understanding.

Creating Structure in Web Archives with Collections: Different Concepts from Web Archivists

This work reviews the collection structures of eight web archive platforms: Archive-It, Conifer, the Croatian Web Archive, the Internet Archive’s user account web archives, Library of Congress, PANDORA, Trove, and the UK Web Archive.

Bootstrapping Web Archive Collections From Micro-Collections in Social Media

This work introduces a novel source for generating seeds from URIs in the threaded conversations of social media posts created by single or multiple users, and presents the Micro-collection/Quality Proxy (MCQP) framework for bootstrapping Web archive collections from Micro-collections in social media.

Hypercane: toolkit for summarizing large collections of archived webpages

This work presents Hypercane, the tool in the DSA suite responsible for selecting exemplar pages, and eight action statements that can be combined in various ways to customize the sample that is produced, which can be used to analyze large web archive collections outside of theDSA suite.

Sowing the Seeds for More Usable Web Archives: A Usability Study of Archive-It

What users expect from web archives is investigated, with several key areas of improvement for the Archive-It service pertaining to metadata options, terminology display, indexing of dates, and the site's search box identified.

Social Cards Probably Provide For Better Understanding Of Web Archive Collections

It is found that social cards and social cards paired side-by-side with browser thumbnails probably provide better collection understanding than the surrogates currently used by the popular Archive-It web archiving platform.

MementoEmbed and Raintale for Web Archive Storytelling

MementoEmbed is created to generate cards for individual mementos and Raintale for creating entire stories that archivists can export to a variety of formats.

SHARI - An Integration of Tools to Visualize the Story of the Day

This paper describes how to combine several existing tools with web archive holdings to perform news analysis and visualization of the "biggest story" for a given date and names this process SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration).

MementoMap Framework for Flexible and Adaptive Web Archive Profiling

A single-pass, memory-efficient, and parallelization-friendly algorithm to compact a large MementoMap into a small one and an in-file binary search method for efficient lookup are designed.

A Bibliographic Scan of Digital Scholarly Communication Infrastructure

Saunders, H. (2021). Mapping Scholarly Communication Infrastructure: A Bibliographic Scan of Digital Scholarly Communication Infrastructure [Book Review]. Journal of Librarianship and Scholarly



Generating Stories From Archived Collections

The Dark and Stormy Archive (DSA) framework is proposed, in which it is found that the stories automatically generated by DSA are indistinguishable from those created by human subject domain experts, while at the same time both kinds of stories are easily distinguished from randomly generated stories.

Detecting off-topic pages within TimeMaps in Web archives

This paper addresses the problems of detecting when a particular page in a Web archive collection has gone off-topic relative to its first archived copy with different methods (cosine similarity, Jaccard similarity, intersection of the 20 most frequent terms, Web-based kernel function, and the change in size using the number of words and content length).

Content selection and curation for web archiving: The gatekeepers vs. the masses

This work recommends a hybrid approach that combines an effort driven by social media and more traditional curatorial methods that can archive pages that are contained in social media streams such as Twitter.

Web archiving in a Web 2.0 world

  • E. Crook
  • Computer Science
    Electron. Libr.
  • 2009
The current state of web archiving in Australia is discussed, and how libraries are adapting their services in recognition of the expanding role that online material plays in their collections is discussed.

Profiling web archive coverage for top-level domain and content language

This work profiles fifteen public web archives using data from a variety of sources (the web, archives’ access logs, and fulltext queries to archives) and uses these profiles as resource descriptor in matching the URI-lookup requests to the most probable web archives.

Observing Web Archives: The Case for an Ethnographic Study of Web Archiving

The concept of web archival labour is proposed to encompass and highlight the ways in which web archivists shape and maintain the preserved Web through work that is often embedded in and obscured by the complex technical arrangements of collection and access.

Characteristics of social media stories

To support automatic story creation, better understand as a baseline the structural characteristics of popular (i.e., receiving the most views) human-generated stories, which are different from the resources in Archive-It collections.

Functionalities of Web Archives

A functionality checklist was designed, based on use cases created by the International Internet Preservation Consortium (IIPC), and the findings of two related user studies, and a comprehensive literature review of web archiving methods were conducted.

The Future of Artist Files: Here Today, Gone Tomorrow

Historically, art librarians saved ephemera—exhibition checklists, gallery announcements, biographical information, and reproductions of works of art—in collections known as artist files.

Scraping SERPs for Archival Seeds: It Matters When You Start

The findings suggest that due to the difficulty in retrieving the URIs of news stories from Google, collection building that originates from search engines should begin as soon as possible in order to capture the first stages of events, and should persist in orderto capture the evolution of the events.