The promises and perils of mining GitHub
@inproceedings{Kalliamvakou2014ThePA, title={The promises and perils of mining GitHub}, author={Eirini Kalliamvakou and Georgios Gousios and Kelly Blincoe and Leif Singer and Daniel M. Germ{\'a}n and Daniela E. Damian}, booktitle={MSR 2014}, year={2014} }
With over 10 million git repositories, GitHub is becoming one of the most important source of software artifacts on the Internet. Researchers are starting to mine the information stored in GitHub's event logs, trying to understand how its users employ the site to collaborate on software. However, so far there have been no studies describing the quality and properties of the data available from GitHub. We document the results of an empirical study aimed at understanding the characteristics of…
654 Citations
An in-depth study of the promises and perils of mining GitHub
- Computer ScienceEmpirical Software Engineering
- 2015
The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and provides a set of recommendations for software engineering researchers on how to approach the data in GitHub.
The Promises and Perils of Mining GitHub ( Extended Version )
- Computer Science
- 2015
The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and a set of recommendations for software engineering researchers on how to approach the data in GitHub is provided.
Understanding the Factors That Impact the Popularity of GitHub Repositories
- Computer Science2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)
- 2016
A study on the popularity of software systems hosted at GitHub, which is the world's largest collection of open source software, reveals the main factors that impact the number of stars of GitHub projects, including programming language and application domain.
Detecting similar repositories on GitHub
- Computer Science2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)
- 2017
This paper proposes a novel approach that can effectively detect similar repositories on GitHub called RepoPal based on three heuristics leveraging two data sources (i.e., GitHub stars and readme files) which are not considered in previous works and compares it to a prior state-of-the-art approach CLAN.
Curating GitHub for engineered software projects
- Computer ScienceEmpirical Software Engineering
- 2017
This work proposes a framework, and presents a reference implementation of the framework as a tool called reaper, to enable researchers to select GitHub repositories that contain evidence of an engineered software project and identifies software engineering practices (called dimensions) and proposes means for validating their existence in a GitHub repository.
A Dataset of the Activity of the Git Super-repository of Linux in 2012
- Computer Science2015 IEEE/ACM 12th Working Conference on Mining Software Repositories
- 2015
This dataset documents the activity in the public portion of the git Super-repository of the Linux kernel during 2012, including the repository of Linus Torvalds, to help understand how kernel contributors use git, how they collaborate and how commits are integrated into the Linux Kernel and into the repositories of organizations that distribute the kernel.
GitQ- Towards Using Badges as Visual Cues for GitHub Projects
- Computer Science2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)
- 2022
This work presents GitQ, to auto-matically augment GitHub repositories with badges representing information about source code and project maintenance, and observed that 11 out of 15 developers perceived GitQ to be useful in identifying the right set of reposi-tories using visual cues such as generated by GitQ.
Automatically Categorising GitHub Repositories by Application Domain
- Computer Science
- 2022
This work builds on a previously annotated dataset of 5,000 GitHub repositories to design an automated classifier for categorising repositories by their application domain and opens promising avenues for future work investigating differences between repositories from di⬀erent application domains.
What's in a GitHub Star? Understanding Repository Starring Practices in a Social Coding Platform
- Computer ScienceJ. Syst. Softw.
- 2018
"May the fork be with you": novel metrics to analyze collaboration on GitHub
- Computer ScienceWETSoM 2014
- 2014
A set of novel metrics, based on an original classification of commits, conceived to capture some interesting aspects of a multi-repository development process, are presented and an efficient way to build a data structure that allows to compute these metrics on a set of Git repositories is described.
References
SHOWING 1-10 OF 46 REFERENCES
The promises and perils of mining git
- Computer Science2009 6th IEEE International Working Conference on Mining Software Repositories
- 2009
This work focuses on git, a very popular DSCM used in high-profile projects and aims to help researchers interested in DSCMs avoid perils when mining and analyzing git data.
GHTorrent: Github's data from a firehose
- Computer Science2012 9th IEEE Working Conference on Mining Software Repositories (MSR)
- 2012
GHTorrent aims to create a scalable off line mirror of GitHub's event streams and persistent data, and offer it to the research community as a service.
The Perils and Pitfalls of Mining SourceForge
- Computer ScienceMSR
- 2004
Practical lessons gained from spidering, parsing and analysis of SourceForge data are outlined, suggesting which variables are used for screening projects and which for testing hypotheses.
Network Structure of Social Coding in GitHub
- Computer Science2013 17th European Conference on Software Maintenance and Reengineering
- 2013
This paper collects 100,000 projects and 30,000 developers from GitHub, constructs developer-developer and project-project relationship graphs, and compute various characteristics of the graphs, which identify influential developers and projects on this sub network of GitHub by using PageRank.
Social Networking Meets Software Development: Perspectives from GitHub, MSDN, Stack Exchange, and TopCoder
- Computer ScienceIEEE Software
- 2013
The guest editors of the January/February 2013 issue conducted semistructured interviews with leaders from four successful companies to gain an understanding of the role social networking plays in today's software development world.
Evaluating the Quality and Quantity of Data on Open Source Software Projects
- Computer Science
- 2005
The number of projects that are active across all of the main indicators of activity account for less than 1% of the projects on the portal, which suggests that many OS projects being registered on SourceForge are ‘impulse’ projects, which do not gather sufficient interest from developers or users to ‘activate’ those projects and make them ‘successful’.
Detecting Patch Submission and Acceptance in OSS Projects
- Computer ScienceFourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007)
- 2007
It is argued that the process of patch submission and acceptance into the codebase is an important piece of the open source puzzle and that the use of patch-related data can be helpful in understanding how OSS projects work.
A network of Rails a graph dataset of Ruby on Rails and associated projects
- Computer Science2013 10th Working Conference on Mining Software Repositories (MSR)
- 2013
This dataset provides insight into the relationships between Ruby on Rails and an ecosystem involving 1116 projects and is provided as a graph database suitable for assessing network properties of the community and individuals within those communities.
Open Borders? Immigration in Open Source Projects
- Computer ScienceFourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007)
- 2007
A quantitative case study of the process by which people join FLOSS projects is mounted, using data mined from the Apache Web server, Postgres, and Python to develop a theory of open source project joining, and evaluates this theory based on the data.
Convergent contemporary software peer review practices
- Computer ScienceESEC/FSE 2013
- 2013
A measure of the degree to which knowledge is shared during review shows that conducting peer review increases the number of distinct files a developer knows about by 66% to 150% depending on the project.