A Large-scale Dataset of (Open Source) License Text Variants

  title={A Large-scale Dataset of (Open Source) License Text Variants},
  author={Stefano Zacchiroli},
  journal={2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR)},
  • Stefano Zacchiroli
  • Published 1 April 2022
  • Computer Science
  • 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR)
We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive-the largest publicly available archive of FOSS source code with accompanying development history-all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open… 

Figures and Tables from this paper



Machine Learning-Based Detection of Open Source License Exceptions

This work performs a large-scale empirical study on the change history of over 51K FOSS systems aimed at quantitatively investigating the prevalence of known license exceptions and identifying new ones and evaluated the license exception classification with four different supervised learners and sensitivity analysis.

The Debsources Dataset: two decades of free and open source software

The Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades.

The Software Heritage Graph Dataset: Public Software Development Under One Roof

The Software Heritage graph dataset is introduced: a fully-deduplicated Merkle DAG representation of the Software Heritage archive that links together file content identifiers, source code directories, Version Control System commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls.

License usage and changes: a large-scale study on gitHub

Results of the study trigger the need for better tool support in guiding developers in choosing/changing licenses and in keeping track of the rationale of license changes, and highlight a lack of traceability of when and why licensing changes are made.

World of Code: Enabling a Research Workflow for Mining and Analyzing the Universe of Open Source VCS data

World of Code is created to create a very large and frequently updated collection of version control data in the entire FLOSS ecosystems named World of Code (WoC), that can completely cross-reference authors, projects, commits, blobs, dependencies, and history of theFLOSS ecosystems and provide capabilities to efficiently correct, augment, query, and analyze that data.

Software provenance tracking at the scale of public source code

Using the properties of isochrone subgraphs, the growth rate of original, i.e. never-seen-before, source code files and commits is quantified and found to be exponential over a period of more than 40 years.

Ultra-Large-Scale Repository Analysis via Graph Compression

The problem of mining the development history—as captured by modern version control systems—of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding) is considered, and graph compression techniques can be applied, dramatically reducing the hardware resources needed to mine similarly-sized corpus.

The Software Heritage Filesystem (SwhFS): Integrating Source Code Archival with Development

SwhFS provides a POSIX filesystem view of Software Heritage, the largest public archive of software source code and version control system (VCS) development history, and can be accessed using common programming tools and custom scripts, as if they were locally available.

Evolutional analysis of licenses in FOSS

This paper analyzes licenses through FreeBSD, OpenBSD, Eclipse, and ArgoUML evolution, using the license analysis tool Ninka, and discusses characteristics on the evolution of the license used in those systems.

Software Heritage: Why and How to Preserve Software Source Code

This paper presents Software Heritage, an ambitious initiative to collect, preserve, and share the entire corpus of publicly accessible software source code, and discusses the archival goals, use cases and role as a participant in the broader digital preservation ecosystem, and detail its key design decisions.