How different are different diff algorithms in Git?

@article{Nugroho2019HowDA,
  title={How different are different diff algorithms in Git?},
  author={Yusuf Sulistyo Nugroho and Hideaki Hata and Ken-ichi Matsumoto},
  journal={Empirical Software Engineering},
  year={2019},
  volume={25},
  pages={790 - 823}
}
Automatic identification of the differences between two versions of a file is a common and basic task in several applications of mining code repositories. Git, a version control system, has a diff utility and users can select algorithms of diff from the default algorithm Myers to the advanced Histogram algorithm. From our systematic mapping, we identified three popular applications of diff in recent studies. On the impact on code churn metrics in 14 Java projects, we obtained different values… 

Mining Python fix patterns via analyzing fine-grained source code changes

This paper collected bug reports from GitHub repository and employed the abstract syntax tree edit distance to cluster similar bug-fixing code changes to generate fix patterns, and evaluated the effectiveness of these fix patterns by applying them to single-hunk bugs in two benchmarks.

Classifying edits to variability in source code

The first complete and unambiguous classification of edits to variability in source code by means of a catalog of edit classes is proposed and validated by classifying each edit in 1.7 million commits in the change histories of 44 open-source software systems automatically.

Evaluating Performance of Clone Detection Tools in Detecting Cloned Cochange Candidates

The findings show that a good clone detector may not perform well in detecting cloned co-change candidates and can enrich a new dimension of code clone research.

Just-In-Time Defect Identification and Localization: A Two-Phase Framework

A JIT defect localization approach that leverages software naturalness with the N-gram model is proposed that achieves a reasonable performance, and outperforms the two baselines by a substantial margin in terms of four ranking measures.

Learning to Generate Corrective Patches using Neural Machine Translation

This paper proposes Ratchet, a corrective patch generation system using neural machine translation, and shows that Ratchet can generate syntactically valid statements 98.7% of the time, and achieve an F1-measure between 0.41-0.83 with respect to the actual fixes adopted in the code base.

Is Kernel Code Different From Non-Kernel Code? A Case Study of BSD Family Operating Systems

This study conducts an exploratory study on four BSD family operating systems to characterize code churn in terms of the annual growth rate, commit types, change type ratio, and size taxonomy of commits for different subsystems (kernel, non-kernel, and mixed).

Software evolution: the lifetime of fine-grained elements

A model regarding the lifetime of individual source code lines or tokens can estimate maintenance effort, guide preventive maintenance, and, more broadly, identify factors that can improve the

Does Refactoring Break Tests and to What Extent?

This study presents a large-scale quantitative study complemented by a qualitative analysis involving 615,196 test cases to understand how and to what extent different refactoring operations impact a system's test suites.

Science-Software Linkage: The Challenges of Traceability between Scientific Knowledge and Software Artifacts

The state of the practice of linking research papers and associated source code is summarized, highlighting the recent efforts towards creating and maintaining such links and outlining challenges related to traceability and opportunities for overcoming these challenges.

References

SHOWING 1-10 OF 40 REFERENCES

The Uniqueness of Changes: Characteristics and Applications

This paper presents a definition of unique changes and provides a method for identifying them in software project history and explores how prevalent unique changes are and investigate where they occur along the architecture of the project.

Change Distilling:Tree Differencing for Fine-Grained Source Code Change Extraction

The change distilling algorithm is presented, a tree differencing algorithm for fine-grained source code change extraction that approximates the minimum edit script 45 percent better than the original change extraction approach by Chawathe et al.

Diff/TS: A Tool for Fine-Grained Structural Change Analysis

This paper reports on a tool for fine-grained analysis of structural changes made between revisions of programs, and presents several applications including software "archaeology'' on a widely known open source software project and automated "phylogenetic'' malware classification based on control flows.

Comparing text‐based and dependence‐based approaches for determining the origins of bugs

Both the text approach and the dependence approach were partially successful across a variety of bugs and suggested the precise definition of program dependence could affect performance, as could whether the approaches identified a single or multiple origins.

Mining Software Repositories for Accurate Authorship

Two new line-level authorship models are presented to overcome the limitation of current tools that assume that the last developer to change a line of code is its author regardless of all earlier changes.

Move-optimized source code tree differencing

  • Georg DotzlerM. Philippsen
  • Computer Science
    2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE)
  • 2016
5 general optimizations that can be added to state-of-the-art tree differencing algorithms to shorten the resulting edit scripts are presented and the novel Move-optimized Tree DIFFerencing algorithm (MTD-IFF) that has a higher accuracy in detecting moved code parts is presented.

An Algorithm for Differential File Comparison

The program diff reports differences between two files, expressed as a minimal list of line changes to bring either file into agreement with the other, based on ideas from several sources.

ClDiff: Generating Concise Linked Code Differences

The goal of ClDiff is to generate concise linked code differences whose granularity is in between the existing code differencing and code change summarization methods, to generate more easily understandable code differences.

Evaluating Complexity, Code Churn, and Developer Activity Metrics as Indicators of Software Vulnerabilities

This work investigated whether software metrics obtained from source code and development history are discriminative and predictive of vulnerable code locations, and predicted over 80 percent of the known vulnerable files with less than 25 percent false positives for both projects.

A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-Introducing Changes

The proposed framework provides a systematic mean for evaluating the data that is generated by a given SZZ implementation and finds that current SZZ implementations still lack mechanisms to accurately identify bug-introducing changes.