GHTorrent: Github's data from a firehose
@article{Gousios2012GHTorrentGD, title={GHTorrent: Github's data from a firehose}, author={Georgios Gousios and Diomidis D. Spinellis}, journal={2012 9th IEEE Working Conference on Mining Software Repositories (MSR)}, year={2012}, pages={12-21} }
A common requirement of many empirical software engineering studies is the acquisition and curation of data from software repositories. During the last few years, GitHub has emerged as a popular project hosting, mirroring and collaboration platform. GitHub provides an extensive REST API, which enables researchers to retrieve both the commits to the projects' repositories and events generated through user actions on project resources. GHTorrent aims to create a scalable off line mirror of GitHub…
225 Citations
Lean GHTorrent: GitHub data on demand
- Computer ScienceMSR 2014
- 2014
A novel feature of GHTorrent designed to offer customisable data dumps on demand is presented, which aims to lower the "barrier for entry" even further for researchers interested in mining GitHub data and enhance the replicability of GitHub studies.
Mining Software Engineering Data from GitHub
- Computer Science2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C)
- 2017
This tutorial analyzes how data from GitHub can be used for large-scale, quantitative research, while avoiding common pitfalls and uses the GHTorrent dataset, a queryable offline mirror of the GitHub API data, to draw examples from and present pitfall avoidance strategies.
The promises and perils of mining GitHub
- Computer ScienceMSR 2014
- 2014
It is shown, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were.
A Tool to Extract Structured Data from GitHub
- Computer ScienceArXiv
- 2020
A tool support, named GitRepository, is developed, which helps in creating a data-set of repositories based on the proposed schema, which hosts 620 repositories (with applied basic filters of stars and forks), and 247 repositories (after applying all pre-defined filters).
Gitana: A SQL-Based Git Repository Inspector
- Computer ScienceER
- 2015
A conceptual schema for Git is proposed and an approach that, given a Git repository, exports its data to a relational database in order to promote data integration with other existing SCM tools and enable writing queries on Git data using standard SQL syntax is proposed.
The Promises and Perils of Mining GitHub ( Extended Version )
- Computer Science
- 2015
The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and a set of recommendations for software engineering researchers on how to approach the data in GitHub is provided.
An in-depth study of the promises and perils of mining GitHub
- Computer ScienceEmpirical Software Engineering
- 2015
The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and provides a set of recommendations for software engineering researchers on how to approach the data in GitHub.
Boa Views: Easy Modularization and Sharing of MSR Analyses
- Computer ScienceMSR
- 2020
The notion of views from the relational database field is used and a query language and runtime infrastructure extension for Boa that is designed that provides output reuse to Boa users and provides for increased sharing and reuse of MSR queries is designed.
The GHTorent dataset and tool suite
- Computer Science2013 10th Working Conference on Mining Software Repositories (MSR)
- 2013
The GHTorent project has been collecting data for all public projects available on Github for more than a year, and the dataset details and construction process are presented.
A Dataset for GitHub Repository Deduplication
- Computer ScienceMSR
- 2020
This work provides a dataset of 10.6 million GitHub projects that are copies of others, and links each record with the project's ultimate parent, and identifies 30 thousand duplicate projects in an existing popular reference dataset of 1.8 million projects.
References
SHOWING 1-10 OF 28 REFERENCES
The promises and perils of mining GitHub
- Computer ScienceMSR 2014
- 2014
It is shown, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were.
The promises and perils of mining git
- Computer Science2009 6th IEEE International Working Conference on Mining Software Repositories
- 2009
This work focuses on git, a very popular DSCM used in high-profile projects and aims to help researchers interested in DSCMs avoid perils when mining and analyzing git data.
Developer identification methods for integrated data from various sources
- Computer ScienceMSR
- 2005
This paper proposes an approach, based on the application of heuristics, to identify the many identities of developers in such cases, and a data structure for allowing both the anonymized distribution of information, and the tracking of identities for verification purposes.
Global software development in the freeBSD project
- BusinessGSD '06
- 2006
FreeBSD is a sophisticated operating system developed and maintained as open-source software by a team of more than 350 individuals located throughout the world. This study uses developer location…
Cohesive and Isolated Development with Branches
- Computer ScienceFASE
- 2012
DVC branching enable natural collaborative processes: DVC branching allows developers to collaborate on tasks in highly cohesive branches, while enjoying reduced interference from developers working on other tasks, even if those tasks are strongly coupled to theirs.
FLOSSMetrics: Free/Libre/Open Source Software Metrics
- Computer Science2009 13th European Conference on Software Maintenance and Reengineering
- 2009
The main objective of FLOSSMETRICS is to construct, publish and analyse a large scale database with information and metrics about libre software development coming from several thousands of software projects, using existing methodologies, and tools already developed.
SourcererDB: An aggregated repository of statically analyzed and cross-linked open source Java projects
- Computer Science2009 6th IEEE International Working Conference on Mining Software Repositories
- 2009
The goal in building SourcererDB is to provide a rich dataset of source code to facilitate the sharing of extracted data and to encourage reuse and repeatability of experiments.
Using Pig as a data preparation language for large-scale mining software repositories studies: An experience report
- Computer ScienceJ. Syst. Softw.
- 2012
Building Knowledge in Open Source Software Research in Six Years of Conferences
- Computer ScienceOSS
- 2011
This work mines articles of the OSS conference series to understand the process of knowledge grounding and the community surrounding it and proposes a semi-automated approach for a systematic mapping study on these articles.
MongoDB: The Definitive Guide
- Computer Science
- 2010
This authoritative introduction to MongoDB will learn the many advantages of using document-oriented databases, and discover why MongoDB is a reliable, high-performance system that allows for almost infinite horizontal scalability.