The promises and perils of mining git

@article{Bird2009ThePA,
  title={The promises and perils of mining git},
  author={Christian Bird and Peter C. Rigby and Earl T. Barr and David J. Hamilton and Daniel M. Germ{\'a}n and Premkumar T. Devanbu},
  journal={2009 6th IEEE International Working Conference on Mining Software Repositories},
  year={2009},
  pages={1-10}
}
We are now witnessing the rapid growth of decentralized source code management (DSCM) systems, in which every developer has her own repository. DSCMs facilitate a style of collaboration in which work output can flow sideways (and privately) between collaborators, rather than always up and down (and publicly) via a central repository. Decentralization comes with both the promise of new data and the peril of its misinterpretation. We focus on git, a very popular DSCM used in high-profile projects… 
Continuously mining distributed version control systems: an empirical study of how Linux uses Git
TLDR
A method that continuously mines all known D-VCSs of a software project to uncover the complete development history of a project is presented and the characteristics of the ecosystem of git repositories of the Linux kernel are investigated.
The promises and perils of mining GitHub
TLDR
It is shown, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were.
Refining code ownership with synchronous changes
TLDR
This work argues that the usage of mainstream SCM (software configuration management) systems influences the way that developers work, and integrates into its ownership measurement a model of memory retention, to simulate the effect of memory loss over time.
Gitana: A SQL-Based Git Repository Inspector
TLDR
A conceptual schema for Git is proposed and an approach that, given a Git repository, exports its data to a relational database in order to promote data integration with other existing SCM tools and enable writing queries on Git data using standard SQL syntax is proposed.
The Promises and Perils of Mining GitHub ( Extended Version )
TLDR
The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and a set of recommendations for software engineering researchers on how to approach the data in GitHub is provided.
Mining File Histories: Should We Consider Branches?
TLDR
This study analyzes over 1,400 Git repositories of four open source ecosystems and compute modification histories for over two million files, using two different algorithms, and finds that considering full file histories leads to an increase in the techniques' performance that is rather modest.
Detection of Named Branch Origin for Git Commits
TLDR
A heuristics-based algorithm is presented to detect the named branch origin of commits based on the merge commit messages and shows an enormous increase in recall when compared to the only existing algorithm for branch name detection.
Recovering Commit Branch of Origin from GitHub Repositories
TLDR
An approach to automatically recover the name of the branch where a given commit is originally made within a GitHub repository is presented and evaluated and shows the average accuracy exceeds 97% of all commits and the average precision exceeds 80%.
A Dataset of the Activity of the Git Super-repository of Linux in 2012
TLDR
This dataset documents the activity in the public portion of the git Super-repository of the Linux kernel during 2012, including the repository of Linus Torvalds, to help understand how kernel contributors use git, how they collaborate and how commits are integrated into the Linux Kernel and into the repositories of organizations that distribute the kernel.
Cohesive and Isolated Development with Branches
TLDR
DVC branching enable natural collaborative processes: DVC branching allows developers to collaborate on tasks in highly cohesive branches, while enjoying reduced interference from developers working on other tasks, even if those tasks are strongly coupled to theirs.
...
...

References

SHOWING 1-10 OF 26 REFERENCES
Fine-grained analysis of change couplings
  • B. Fluri, H. Gall, M. Pinzger
  • Computer Science
    Fifth IEEE International Workshop on Source Code Analysis and Manipulation (SCAM'05)
  • 2005
TLDR
This paper presents an approach that uses the structure compare services shipped with the Eclipse IDE to obtain the corresponding finegrained changes between two subsequent versions of any Java class, and distill the causes for change couplings along releases and filter out those that are structurally relevant.
Detecting Patch Submission and Acceptance in OSS Projects
TLDR
It is argued that the process of patch submission and acceptance into the codebase is an important piece of the open source puzzle and that the use of patch-related data can be helpful in understanding how OSS projects work.
Open Borders? Immigration in Open Source Projects
TLDR
A quantitative case study of the process by which people join FLOSS projects is mounted, using data mined from the Apache Web server, Postgres, and Python to develop a theory of open source project joining, and evaluates this theory based on the data.
Regurgitate : Using GIT For F / LOSS Data Collection
TLDR
A new tool is created, regurgitate, for importing CVS repositories into the GIT source code management system, which is fast enough that it is practical to replay the entire development history of a project commitat-a-time, collecting metrics at each step.
Hipikat: a project memory for software development
TLDR
This work describes the Hipikat tool, a tool that provides developers with efficient and effective access to the group memory for a software development project that is implicitly formed by all of the artifacts produced during the development.
Predicting faults from cached history
TLDR
In the evaluation of seven open source projects with more than 200,000 revisions, the cache selects 10% of the source code files; these files account for 73%-95% of faults--a significant advance beyond the state of the art.
Mining version histories to guide software changes
TLDR
The ROSE prototype can correctly predict further locations to be changed and show up item coupling that is undetectable by program analysis, and can prevent errors due to incomplete changes.
Populating a Release History Database from version control and bug tracking systems
TLDR
An approach is introduced for populating a release history database that combines version data with bug tracking data and adds missing data not covered by version control systems such as merge points to obtain meaningful views showing the evolution of a software project.
Identifying Changed Source Code Lines from Version Repositories
TLDR
This paper shows how the evolution of changes at source code line level can be inferred from CVS repositories, by combining information retrieval techniques and the Levenshtein edit distance.
Two case studies of open source software development: Apache and Mozilla
TLDR
This work examines data from two major open source projects, the Apache web server and the Mozilla browser, and quantifies aspects of developer participation, core team size, code ownership, productivity, defect density, and problem resolution intervals for these OSS projects.
...
...