• Corpus ID: 227335233

A Tool to Extract Structured Data from GitHub

  title={A Tool to Extract Structured Data from GitHub},
  author={Shreyans R. Surana and Smit Detroja and Saurabh Tiwari},
GitHub repositories consist of various detailed information about the project contributors, the number of commits and its contributors, releases, pull requests, programming languages, and issues. However, no systematic dataset of open source projects exists which features detailed information about the repositories on GitHub for knowledge acquisition and mining. In this paper, we developed tool support, named GitRepository, which helps in creating a data-set of repositories based on the… 

Figures from this paper

Sampling Projects in GitHub for MSR Studies
GHS (GitHub Search), a dataset containing 25 characteristics of 735,669 repositories written in 10 programming languages, derived by looking for frequently used project selection criteria in MSR studies and the dataset is continuously updated to always provide fresh data about the existing projects, and increase the number of indexed projects.


GHTorrent: Github's data from a firehose
GHTorrent aims to create a scalable off line mirror of GitHub's event streams and persistent data, and offer it to the research community as a service.
Summarising Big Data: Common GitHub Dataset for Software Engineering Challenges
A common dataset was created and shared with the researchers, which would allow them to work on many software engineering problems and make a comparison of the accuracy of the studies quite difficult.
Boa: A language and infrastructure for analyzing ultra-large-scale software repositories
The goal of Boa, a domain-specific language and infrastructure described here, is to ease testing MSR-related hypotheses and implement Boa and provide a web-based interface to Boa's infrastructure.
The GHTorent dataset and tool suite
  • Georgios Gousios
  • Computer Science
    2013 10th Working Conference on Mining Software Repositories (MSR)
  • 2013
The GHTorent project has been collecting data for all public projects available on Github for more than a year, and the dataset details and construction process are presented.
RapidRelease - A Dataset of Projects and Issues on Github with Rapid Releases
This paper introduces the RapidRelease dataset, a data showcase of high release frequency open-source projects, believed to be the first dataset that can facilitate researchers to empirically study release engineering and agile software development in open- source projects with rapid releases.
Developer onboarding in GitHub: the role of prior social links and language experience
This work explores the GitHub evidence for socialization as a precursor to joining a project, and finds that the presence of past social connections combined with prior experience in languages dominant in the project leads to higher productivity both initially and cumulatively.
Recommending GitHub Projects for Developer Onboarding
Experimental results show that NNLRank can provide effective and efficient onboarding recommendation to developers, substantially outperforming the previous models.