The promises and perils of mining GitHub

@inproceedings{Kalliamvakou2014ThePA,
  title={The promises and perils of mining GitHub},
  author={Eirini Kalliamvakou and Georgios Gousios and Kelly Blincoe and Leif Singer and Daniel M. Germ{\'a}n and Daniela E. Damian},
  booktitle={MSR 2014},
  year={2014}
}
With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of… 

Figures and Tables from this paper

An in-depth study of the promises and perils of mining GitHub
TLDR
The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and provides a set of recommendations for software engineering researchers on how to approach the data in GitHub.
The Promises and Perils of Mining GitHub ( Extended Version )
TLDR
The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and a set of recommendations for software engineering researchers on how to approach the data in GitHub is provided.
Understanding the Factors That Impact the Popularity of GitHub Repositories
TLDR
A study on the popularity of software systems hosted at GitHub, which is the world's largest collection of open source software, reveals the main factors that impact the number of stars of GitHub projects, including programming language and application domain.
Detecting similar repositories on GitHub
TLDR
This paper proposes a novel approach that can effectively detect similar repositories on GitHub called RepoPal based on three heuristics leveraging two data sources (i.e., GitHub stars and readme files) which are not considered in previous works and compares it to a prior state-of-the-art approach CLAN.
Curating GitHub for engineered software projects
TLDR
This work proposes a framework, and presents a reference implementation of the framework as a tool called reaper, to enable researchers to select GitHub repositories that contain evidence of an engineered software project and identifies software engineering practices (called dimensions) and proposes means for validating their existence in a GitHub repository.
A Dataset of the Activity of the Git Super-repository of Linux in 2012
TLDR
This dataset documents the activity in the public portion of the git Super-repository of the Linux kernel during 2012, including the repository of Linus Torvalds, to help understand how kernel contributors use git, how they collaborate and how commits are integrated into the Linux Kernel and into the repositories of organizations that distribute the kernel.
GitQ- Towards Using Badges as Visual Cues for GitHub Projects
TLDR
This work presents GitQ, to auto-matically augment GitHub repositories with badges representing information about source code and project maintenance, and observed that 11 out of 15 developers perceived GitQ to be useful in identifying the right set of reposi-tories using visual cues such as generated by GitQ.
Automatically Categorising GitHub Repositories by Application Domain
TLDR
This work builds on a previously annotated dataset of 5,000 GitHub repositories to design an automated classifier for categorising repositories by their application domain and opens promising avenues for future work investigating differences between repositories from di⬀erent application domains.
"May the fork be with you": novel metrics to analyze collaboration on GitHub
TLDR
A set of novel metrics, based on an original classification of commits, conceived to capture some interesting aspects of a multi-repository development process, are presented and an efficient way to build a data structure that allows to compute these metrics on a set of Git repositories is described.
...
...

References

SHOWING 1-10 OF 46 REFERENCES
The promises and perils of mining git
TLDR
This work focuses on git, a very popular DSCM used in high-profile projects and aims to help researchers interested in DSCMs avoid perils when mining and analyzing git data.
GHTorrent: Github's data from a firehose
TLDR
GHTorrent aims to create a scalable off line mirror of GitHub's event streams and persistent data, and offer it to the research community as a service.
The Perils and Pitfalls of Mining SourceForge
TLDR
Practical lessons gained from spidering, parsing and analysis of SourceForge data are outlined, suggesting which variables are used for screening projects and which for testing hypotheses.
Network Structure of Social Coding in GitHub
TLDR
This paper collects 100,000 projects and 30,000 developers from GitHub, constructs developer-developer and project-project relationship graphs, and compute various characteristics of the graphs, which identify influential developers and projects on this sub network of GitHub by using PageRank.
Social Networking Meets Software Development: Perspectives from GitHub, MSDN, Stack Exchange, and TopCoder
TLDR
The guest editors of the January/February 2013 issue conducted semistructured interviews with leaders from four successful companies to gain an understanding of the role social networking plays in today's software development world.
Evaluating the Quality and Quantity of Data on Open Source Software Projects
TLDR
The number of projects that are active across all of the main indicators of activity account for less than 1% of the projects on the portal, which suggests that many OS projects being registered on SourceForge are ‘impulse’ projects, which do not gather sufficient interest from developers or users to ‘activate’ those projects and make them ‘successful’.
Detecting Patch Submission and Acceptance in OSS Projects
TLDR
It is argued that the process of patch submission and acceptance into the codebase is an important piece of the open source puzzle and that the use of patch-related data can be helpful in understanding how OSS projects work.
A network of Rails a graph dataset of Ruby on Rails and associated projects
TLDR
This dataset provides insight into the relationships between Ruby on Rails and an ecosystem involving 1116 projects and is provided as a graph database suitable for assessing network properties of the community and individuals within those communities.
Open Borders? Immigration in Open Source Projects
TLDR
A quantitative case study of the process by which people join FLOSS projects is mounted, using data mined from the Apache Web server, Postgres, and Python to develop a theory of open source project joining, and evaluates this theory based on the data.
Convergent contemporary software peer review practices
TLDR
A measure of the degree to which knowledge is shared during review shows that conducting peer review increases the number of distinct files a developer knows about by 66% to 150% depending on the project.
...
...