The GHTorent dataset and tool suite

@article{Gousios2013TheGD,
  title={The GHTorent dataset and tool suite},
  author={Georgios Gousios},
  journal={2013 10th Working Conference on Mining Software Repositories (MSR)},
  year={2013},
  pages={233-236}
}
  • Georgios Gousios
  • Published 18 May 2013
  • Computer Science
  • 2013 10th Working Conference on Mining Software Repositories (MSR)
During the last few years, GitHub has emerged as a popular project hosting, mirroring and collaboration platform. GitHub provides an extensive REST API, which enables researchers to retrieve high-quality, interconnected data. The GHTorent project has been collecting data for all public projects available on Github for more than a year. In this paper, we present the dataset details and construction process and outline the challenges and research opportunities emerging from it. 

Figures from this paper

A Tool to Extract Structured Data from GitHub
TLDR
A tool support, named GitRepository, is developed, which helps in creating a data-set of repositories based on the proposed schema, which hosts 620 repositories (with applied basic filters of stars and forks), and 247 repositories (after applying all pre-defined filters).
Three trillion lines: infrastructure for mining GitHub in the classroom
TLDR
There is a need for domain-specific tools, especially databases, that can deal with large-scale code repositories and associated meta-data and open challenges to use them more effectively for research and machine learning settings.
GitEvolve: Predicting the Evolution of GitHub Repositories
TLDR
This work proposes GitEvolve, a system to predict the evolution of GitHub repositories and the different ways by which users interact with them, and develops an end-to-end multi-task sequential deep neural network that simultaneously predicts which user-group is next going to interact with a given repository.
A Dataset for GitHub Repository Deduplication: Extended Description.
TLDR
This work provides a dataset of 10.6 million GitHub projects that are copies of others, and links each record with the project's ultimate parent, and identifies 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects.
SemanGit: A Linked Dataset from git
TLDR
This article presents the dataset, describes the extraction process according to the ontology, shows some promising analyses of the data and outlines how SemanGit could be linked with external datasets or enriched with new sources to allow for more complex analyses.
A Dataset for GitHub Repository Deduplication
TLDR
This work provides a dataset of 10.6 million GitHub projects that are copies of others, and links each record with the project's ultimate parent, and identifies 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects.
GHTraffic: A Dataset for Reproducible Research in Service-Oriented Computing
TLDR
Use cases for such a dataset of significant size comprising HTTP transactions extracted from GitHub data and augmented with synthetic transaction data are discussed and a set of requirements are extracted from these use cases.
Data collection and analysis of GitHub repositories and users
TLDR
The collection and mining of GitHub data is presented, aiming to understand GitHub user behavior and project success factors and seven success rules for GitHub projects are presented.
An in-depth study of the promises and perils of mining GitHub
TLDR
The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and provides a set of recommendations for software engineering researchers on how to approach the data in GitHub.
A dataset for pull-based development research
TLDR
A dataset of almost 900 projects and 350,000 pull requests, including some of the largest users of pull requests on Github is constructed, and a machine learning tool set for the R statistics environment is presented.
...
...

References

SHOWING 1-3 OF 3 REFERENCES
GHTorrent: Github's data from a firehose
TLDR
GHTorrent aims to create a scalable off line mirror of GitHub's event streams and persistent data, and offer it to the research community as a service.
An exploration of the pullbased software development model
  • Mar. 2013. Submitted to the ACM symposium on the Foundations of Software Engineering 2013.
  • 2013
The Github archive
  • Mar. 2012. Online, accessed Feb 2013. 236
  • 2012