The promises and perils of mining GitHub

  title={The promises and perils of mining GitHub},
  author={Eirini Kalliamvakou and Georgios Gousios and Kelly Blincoe and Leif Singer and Daniel M. Germ{\'a}n and Daniela E. Damian},
  booktitle={MSR 2014},
With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of… 

Figures and Tables from this paper

An in-depth study of the promises and perils of mining GitHub

The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and provides a set of recommendations for software engineering researchers on how to approach the data in GitHub.

The Promises and Perils of Mining GitHub ( Extended Version )

The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and a set of recommendations for software engineering researchers on how to approach the data in GitHub is provided.

Efficient GitHub Crawling Using the GraphQL API

This paper presents Prometheus, a system for crawling and storing software repositories from GitHub that follows an event-driven microservice architecture and can significantly outperform alternatives in terms of throughput in some scenarios.

Understanding the Factors That Impact the Popularity of GitHub Repositories

A study on the popularity of software systems hosted at GitHub, which is the world's largest collection of open source software, reveals the main factors that impact the number of stars of GitHub projects, including programming language and application domain.

Detecting similar repositories on GitHub

This paper proposes a novel approach that can effectively detect similar repositories on GitHub called RepoPal based on three heuristics leveraging two data sources (i.e., GitHub stars and readme files) which are not considered in previous works and compares it to a prior state-of-the-art approach CLAN.

Curating GitHub for engineered software projects

This work proposes a framework, and presents a reference implementation of the framework as a tool called reaper, to enable researchers to select GitHub repositories that contain evidence of an engineered software project and identifies software engineering practices (called dimensions) and proposes means for validating their existence in a GitHub repository.

A Dataset of the Activity of the Git Super-repository of Linux in 2012

This dataset documents the activity in the public portion of the git Super-repository of the Linux kernel during 2012, including the repository of Linus Torvalds, to help understand how kernel contributors use git, how they collaborate and how commits are integrated into the Linux Kernel and into the repositories of organizations that distribute the kernel.

GitQ- Towards Using Badges as Visual Cues for GitHub Projects

This work presents GitQ, to auto-matically augment GitHub repositories with badges representing information about source code and project maintenance, and observed that 11 out of 15 developers perceived GitQ to be useful in identifying the right set of reposi-tories using visual cues such as generated by GitQ.

Automatically Categorising GitHub Repositories by Application Domain

This work builds on a previously annotated dataset of 5,000 GitHub repositories to design an automated classifier for categorising repositories by their application domain and opens promising avenues for future work investigating differences between repositories from di⬀erent application domains.



The promises and perils of mining git

This work focuses on git, a very popular DSCM used in high-profile projects and aims to help researchers interested in DSCMs avoid perils when mining and analyzing git data.

GHTorrent: Github's data from a firehose

GHTorrent aims to create a scalable off line mirror of GitHub's event streams and persistent data, and offer it to the research community as a service.

The Perils and Pitfalls of Mining SourceForge

Practical lessons gained from spidering, parsing and analysis of SourceForge data are outlined, suggesting which variables are used for screening projects and which for testing hypotheses.

Network Structure of Social Coding in GitHub

This paper collects 100,000 projects and 30,000 developers from GitHub, constructs developer-developer and project-project relationship graphs, and compute various characteristics of the graphs, which identify influential developers and projects on this sub network of GitHub by using PageRank.

Social Networking Meets Software Development: Perspectives from GitHub, MSDN, Stack Exchange, and TopCoder

The guest editors of the January/February 2013 issue conducted semistructured interviews with leaders from four successful companies to gain an understanding of the role social networking plays in today's software development world.

Evaluating the Quality and Quantity of Data on Open Source Software Projects

The number of projects that are active across all of the main indicators of activity account for less than 1% of the projects on the portal, which suggests that many OS projects being registered on SourceForge are ‘impulse’ projects, which do not gather sufficient interest from developers or users to ‘activate’ those projects and make them ‘successful’.

Detecting Patch Submission and Acceptance in OSS Projects

It is argued that the process of patch submission and acceptance into the codebase is an important piece of the open source puzzle and that the use of patch-related data can be helpful in understanding how OSS projects work.

A network of Rails a graph dataset of Ruby on Rails and associated projects

This dataset provides insight into the relationships between Ruby on Rails and an ecosystem involving 1116 projects and is provided as a graph database suitable for assessing network properties of the community and individuals within those communities.

Open Borders? Immigration in Open Source Projects

A quantitative case study of the process by which people join FLOSS projects is mounted, using data mined from the Apache Web server, Postgres, and Python to develop a theory of open source project joining, and evaluates this theory based on the data.

Convergent contemporary software peer review practices

A measure of the degree to which knowledge is shared during review shows that conducting peer review increases the number of distinct files a developer knows about by 66% to 150% depending on the project.