A Tool to Extract Structured Data from GitHub
@article{Surana2020ATT, title={A Tool to Extract Structured Data from GitHub}, author={Shreyans R. Surana and Smit Detroja and Saurabh Tiwari}, journal={ArXiv}, year={2020}, volume={abs/2012.03453} }
GitHub repositories consist of various detailed information about the project contributors, the number of commits and its contributors, releases, pull requests, programming languages, and issues. However, no systematic dataset of open source projects exists which features detailed information about the repositories on GitHub for knowledge acquisition and mining. In this paper, we developed tool support, named GitRepository, which helps in creating a data-set of repositories based on the…
One Citation
Sampling Projects in GitHub for MSR Studies
- Computer Science2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)
- 2021
GHS (GitHub Search), a dataset containing 25 characteristics of 735,669 repositories written in 10 programming languages, derived by looking for frequently used project selection criteria in MSR studies and the dataset is continuously updated to always provide fresh data about the existing projects, and increase the number of indexed projects.
References
SHOWING 1-7 OF 7 REFERENCES
GHTorrent: Github's data from a firehose
- Computer Science2012 9th IEEE Working Conference on Mining Software Repositories (MSR)
- 2012
GHTorrent aims to create a scalable off line mirror of GitHub's event streams and persistent data, and offer it to the research community as a service.
Summarising Big Data: Common GitHub Dataset for Software Engineering Challenges
- Computer ScienceArXiv
- 2020
A common dataset was created and shared with the researchers, which would allow them to work on many software engineering problems and make a comparison of the accuracy of the studies quite difficult.
Boa: A language and infrastructure for analyzing ultra-large-scale software repositories
- Computer Science2013 35th International Conference on Software Engineering (ICSE)
- 2013
The goal of Boa, a domain-specific language and infrastructure described here, is to ease testing MSR-related hypotheses and implement Boa and provide a web-based interface to Boa's infrastructure.
The GHTorent dataset and tool suite
- Computer Science2013 10th Working Conference on Mining Software Repositories (MSR)
- 2013
The GHTorent project has been collecting data for all public projects available on Github for more than a year, and the dataset details and construction process are presented.
RapidRelease - A Dataset of Projects and Issues on Github with Rapid Releases
- Computer Science2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
- 2019
This paper introduces the RapidRelease dataset, a data showcase of high release frequency open-source projects, believed to be the first dataset that can facilitate researchers to empirically study release engineering and agile software development in open- source projects with rapid releases.
Developer onboarding in GitHub: the role of prior social links and language experience
- EconomicsESEC/SIGSOFT FSE
- 2015
This work explores the GitHub evidence for socialization as a precursor to joining a project, and finds that the presence of past social connections combined with prior experience in languages dominant in the project leads to higher productivity both initially and cumulatively.
Recommending GitHub Projects for Developer Onboarding
- Computer ScienceIEEE Access
- 2018
Experimental results show that NNLRank can provide effective and efficient onboarding recommendation to developers, substantially outperforming the previous models.