Escaping the Time Pit: Pitfalls and Guidelines for Using Time-Based Git Data

@article{Flint2021EscapingTT,
  title={Escaping the Time Pit: Pitfalls and Guidelines for Using Time-Based Git Data},
  author={Samuel W. Flint and Jigyasa Chauhan and Robert Dyer},
  journal={2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)},
  year={2021},
  pages={85-96}
}
Many software engineering research papers rely on time-based data (e.g., commit timestamps, issue report creation/update/close dates, release dates). Like most real-world data however, time-based data is often dirty. To date, there are no studies that quantify how frequently such data is used by the software engineering research community, or investigate sources of and quantify how often such data is dirty. Depending on the research task and method used, including such dirty data could affect… 

Figures and Tables from this paper

How are Software Repositories Mined? A Systematic Literature Review of Workflows, Methodologies, Reproducibility, and Tools
TLDR
It is found that an important part of the research workflow involving dataset selection was particularly problematic, which raises questions about the generality of the results in existing literature and proposes ways to address these shortcomings via existing tools.
Cross-Project Online Just-In-Time Software Defect Prediction
TLDR
It is found that training classifiers with incoming CP+WP data can lead to absolute improvements in G-mean, leading to the first investigation of when and to what extent CP data are useful for JIT-SDP in such realistic scenarios.
ery-Analysis on Git-A ributes in Relational and Graph-DB
  • 2022

References

SHOWING 1-10 OF 40 REFERENCES
Replicating MSR: A study of the potential replicability of papers published in the Mining Software Repositories proceedings
  • G. Robles
  • Computer Science
    2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010)
  • 2010
TLDR
Results show that MSR authors use in general publicly available data sources, mainly from free software repositories, but that the amount of publicly available processed datasets is very low.
The promises and perils of mining GitHub
TLDR
It is shown, for example, that the majority of the projects are personal and inactive; that GitHub is also being used for free storage and as a Web hosting service; and that almost 40% of all pull requests do not appear as merged, even though they were.
An in-depth study of the promises and perils of mining GitHub
TLDR
The results indicate that while GitHub is a rich source of data on software development, mining GitHub for research purposes should take various potential perils into consideration, and provides a set of recommendations for software engineering researchers on how to approach the data in GitHub.
Mining usage data and development artifacts
TLDR
This work explores how usage data that has been extracted from web server logs can be unified with product release history to study questions that concern both users' detailed dynamic behaviour as well as broad adoption trends across different deployment environments.
Boa: A language and infrastructure for analyzing ultra-large-scale software repositories
TLDR
The goal of Boa, a domain-specific language and infrastructure described here, is to ease testing MSR-related hypotheses and implement Boa and provide a web-based interface to Boa's infrastructure.
The promises and perils of mining git
TLDR
This work focuses on git, a very popular DSCM used in high-profile projects and aims to help researchers interested in DSCMs avoid perils when mining and analyzing git data.
Replicating mining studies with SOFAS
  • Giacomo Ghezzi, H. Gall
  • Computer Science
    2013 10th Working Conference on Mining Software Repositories (MSR)
  • 2013
TLDR
This paper investigates the mining studies of MSR from 2004 to 2011 and finds that from 88 studies published in the MSR proceedings so far, it can fully replicate 25 empirical studies and can replicate 27 additional studies to a large extent.
Impacts of Daylight Saving Time on Software Development
TLDR
The impacts of DST on software development is studied by mining the repositories on GitHub for the date when the code related to DST is changed, and the regions where the developers applied the changes live.
The MSR Cookbook: Mining a decade of research
TLDR
A review of all 117 full papers published in the MSR proceedings between 2004 and 2012 extracts 268 comments from these papers, and categorizes them using a grounded theory methodology.
An Empirical Study of Multiple Names and Email Addresses in OSS Version Control Repositories
  • Jiaxin Zhu, Jun Wei
  • Computer Science
    2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
  • 2019
TLDR
The impact analysis shows that the multiple names and email addresses issue cannot be ignored for the basic related measurements, e.g., the number of developers in a repository and the accuracy of related measurements.
...
...