GHTorrent: Github's data from a firehose

@article{Gousios2012GHTorrentGD,
  title={GHTorrent: Github's data from a firehose},
  author={Georgios Gousios and Diomidis D. Spinellis},
  journal={2012 9th IEEE Working Conference on Mining Software Repositories (MSR)},
  year={2012},
  pages={12-21}
}
A common requirement of many empirical software engineering studies is the acquisition and curation of data from software repositories. During the last few years, GitHub has emerged as a popular project hosting, mirroring and collaboration platform. GitHub provides an extensive REST API, which enables researchers to retrieve both the commits to the projects' repositories and events generated through user actions on project resources. GHTorrent aims to create a scalable off line mirror of GitHub… 

Figures and Tables from this paper

Lean GHTorrent: GitHub data on demand
TLDR
A novel feature of GHTorrent designed to offer customisable data dumps on demand is presented, which aims to lower the "barrier for entry" even further for researchers interested in mining GitHub data and enhance the replicability of GitHub studies.
Mining Software Engineering Data from GitHub
TLDR
This tutorial analyzes how data from GitHub can be used for large-scale, quantitative research, while avoiding common pitfalls and uses the GHTorrent dataset, a queryable offline mirror of the GitHub API data, to draw examples from and present pitfall avoidance strategies.
The promises and perils of mining GitHub
TLDR
It is shown, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were.
A Tool to Extract Structured Data from GitHub
TLDR
A tool support, named GitRepository, is developed, which helps in creating a data-set of repositories based on the proposed schema, which hosts 620 repositories (with applied basic filters of stars and forks), and 247 repositories (after applying all pre-defined filters).
A Dataset of Duplicate Pull-Requests in GitHub
TLDR
A large dataset of historical duplicate PRs extracted from 26 popular open source projects in GitHub is constructed by using a semi-automatic approach to facilitate the further studies to better understand and solve the issues introduced by duplicatePRs.
Gitana: A SQL-Based Git Repository Inspector
TLDR
A conceptual schema for Git is proposed and an approach that, given a Git repository, exports its data to a relational database in order to promote data integration with other existing SCM tools and enable writing queries on Git data using standard SQL syntax is proposed.
The Promises and Perils of Mining GitHub ( Extended Version )
TLDR
The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and a set of recommendations for software engineering researchers on how to approach the data in GitHub is provided.
An in-depth study of the promises and perils of mining GitHub
TLDR
The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and provides a set of recommendations for software engineering researchers on how to approach the data in GitHub.
Boa Views: Easy Modularization and Sharing of MSR Analyses
TLDR
The notion of views from the relational database field is used and a query language and runtime infrastructure extension for Boa that is designed that provides output reuse to Boa users and provides for increased sharing and reuse of MSR queries is designed.
The GHTorent dataset and tool suite
  • Georgios Gousios
  • Computer Science
    2013 10th Working Conference on Mining Software Repositories (MSR)
  • 2013
TLDR
The GHTorent project has been collecting data for all public projects available on Github for more than a year, and the dataset details and construction process are presented.
...
...

References

SHOWING 1-10 OF 28 REFERENCES
The promises and perils of mining GitHub
TLDR
It is shown, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were.
The promises and perils of mining git
TLDR
This work focuses on git, a very popular DSCM used in high-profile projects and aims to help researchers interested in DSCMs avoid perils when mining and analyzing git data.
Developer identification methods for integrated data from various sources
TLDR
This paper proposes an approach, based on the application of heuristics, to identify the many identities of developers in such cases, and a data structure for allowing both the anonymized distribution of information, and the tracking of identities for verification purposes.
Global software development in the freeBSD project
FreeBSD is a sophisticated operating system developed and maintained as open-source software by a team of more than 350 individuals located throughout the world. This study uses developer location
Cohesive and Isolated Development with Branches
TLDR
DVC branching enable natural collaborative processes: DVC branching allows developers to collaborate on tasks in highly cohesive branches, while enjoying reduced interference from developers working on other tasks, even if those tasks are strongly coupled to theirs.
FLOSSMetrics: Free/Libre/Open Source Software Metrics
TLDR
The main objective of FLOSSMETRICS is to construct, publish and analyse a large scale database with information and metrics about libre software development coming from several thousands of software projects, using existing methodologies, and tools already developed.
SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects
TLDR
The goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments.
Building Knowledge in Open Source Software Research in Six Years of Conferences
TLDR
This work mines articles of the OSS conference series to understand the process of knowledge grounding and the community surrounding it and proposes a semi-automated approach for a systematic mapping study on these articles.
MongoDB: The Definitive Guide
TLDR
This authoritative introduction to MongoDB will learn the many advantages of using document-oriented databases, and discover why MongoDB is a reliable, high-performance system that allows for almost infinite horizontal scalability.
...
...