GHTorrent: Github's data from a firehose

  title={GHTorrent: Github's data from a firehose},
  author={Georgios Gousios and Diomidis D. Spinellis},
  journal={2012 9th IEEE Working Conference on Mining Software Repositories (MSR)},
A common requirement of many empirical software engineering studies is the acquisition and curation of data from software repositories. During the last few years, GitHub has emerged as a popular project hosting, mirroring and collaboration platform. GitHub provides an extensive REST API, which enables researchers to retrieve both the commits to the projects' repositories and events generated through user actions on project resources. GHTorrent aims to create a scalable off line mirror of GitHub… 

Figures and Tables from this paper

Lean GHTorrent: GitHub data on demand

A novel feature of GHTorrent designed to offer customisable data dumps on demand is presented, which aims to lower the "barrier for entry" even further for researchers interested in mining GitHub data and enhance the replicability of GitHub studies.

Mining Software Engineering Data from GitHub

This tutorial analyzes how data from GitHub can be used for large-scale, quantitative research, while avoiding common pitfalls and uses the GHTorrent dataset, a queryable offline mirror of the GitHub API data, to draw examples from and present pitfall avoidance strategies.

A Tool to Extract Structured Data from GitHub

A tool support, named GitRepository, is developed, which helps in creating a data-set of repositories based on the proposed schema, which hosts 620 repositories (with applied basic filters of stars and forks), and 247 repositories (after applying all pre-defined filters).

A Dataset of Duplicate Pull-Requests in GitHub

A large dataset of historical duplicate PRs extracted from 26 popular open source projects in GitHub is constructed by using a semi-automatic approach to facilitate the further studies to better understand and solve the issues introduced by duplicatePRs.

Gitana: A SQL-Based Git Repository Inspector

A conceptual schema for Git is proposed and an approach that, given a Git repository, exports its data to a relational database in order to promote data integration with other existing SCM tools and enable writing queries on Git data using standard SQL syntax is proposed.

Efficient GitHub Crawling Using the GraphQL API

This paper presents Prometheus, a system for crawling and storing software repositories from GitHub that follows an event-driven microservice architecture and can significantly outperform alternatives in terms of throughput in some scenarios.

The Promises and Perils of Mining GitHub ( Extended Version )

The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and a set of recommendations for software engineering researchers on how to approach the data in GitHub is provided.

An in-depth study of the promises and perils of mining GitHub

The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and provides a set of recommendations for software engineering researchers on how to approach the data in GitHub.

Boa Views: Easy Modularization and Sharing of MSR Analyses

The notion of views from the relational database field is used and a query language and runtime infrastructure extension for Boa that is designed that provides output reuse to Boa users and provides for increased sharing and reuse of MSR queries is designed.

The GHTorent dataset and tool suite

  • Georgios Gousios
  • Computer Science
    2013 10th Working Conference on Mining Software Repositories (MSR)
  • 2013
The GHTorent project has been collecting data for all public projects available on Github for more than a year, and the dataset details and construction process are presented.



The promises and perils of mining git

This work focuses on git, a very popular DSCM used in high-profile projects and aims to help researchers interested in DSCMs avoid perils when mining and analyzing git data.

Developer identification methods for integrated data from various sources

This paper proposes an approach, based on the application of heuristics, to identify the many identities of developers in such cases, and a data structure for allowing both the anonymized distribution of information, and the tracking of identities for verification purposes.

Global software development in the freeBSD project

FreeBSD is a sophisticated operating system developed and maintained as open-source software by a team of more than 350 individuals located throughout the world. This study uses developer location

Cohesive and Isolated Development with Branches

DVC branching enable natural collaborative processes: DVC branching allows developers to collaborate on tasks in highly cohesive branches, while enjoying reduced interference from developers working on other tasks, even if those tasks are strongly coupled to theirs.

FLOSSMetrics: Free/Libre/Open Source Software Metrics

The main objective of FLOSSMETRICS is to construct, publish and analyse a large scale database with information and metrics about libre software development coming from several thousands of software projects, using existing methodologies, and tools already developed.

SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects

The goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments.

Building Knowledge in Open Source Software Research in Six Years of Conferences

This work mines articles of the OSS conference series to understand the process of knowledge grounding and the community surrounding it and proposes a semi-automated approach for a systematic mapping study on these articles.

Version Control with Git: Powerful Tools and Techniques for Collaborative Software Development

Version Control with Git takes you step-by-step through ways to track, merge, and manage software projects, using this highly flexible, open source version control system. Git permits virtually an

Effort estimation of FLOSS projects: a study of the Linux kernel

The results of this research show that, overall, the effort within the Linux kernel community is constant (albeit at different levels) throughout the week, signalling the need of updated estimation models, different from those used in traditional 9am–5pm, Monday to Friday commercial companies.