On the Nature of Code Cloning in Open-Source Java Projects

  title={On the Nature of Code Cloning in Open-Source Java Projects},
  author={Yaroslav Golubev and Timofey Bryksin},
  journal={2021 IEEE 15th International Workshop on Software Clones (IWSC)},
Code cloning plays a very important role in open-source software engineering. The presence of clones within a project may indicate a need for refactoring, and clones between projects are even more interesting, since code migration takes place and violations are possible. But how is code being copied? How prevalent is the process and on what level does it happen?In this general study, we attempt to shed some light on these questions by searching for clones in a large dataset of over 23 thousand… 

Figures from this paper

Survey on Software code clone detection

What are software clones, their kinds, and the methods used to detect them, and a list of researches in this field till Jan 2022 are explained.



File cloning in open source Java projects: The good, the bad, and the ugly

A novel method of file-level code clone detection that is scalable to millions of files is developed and found the most commonly cloned files to be Java extension classes and popular third-party libraries, both large and small.

Near-miss function clones in open source software: an empirical study

An in-depth study of near-miss function clones in open source software using NICAD, which examines more than 20 open source C, Java and Cn systems of varying kinds and sizes including the entire Linux Kernel.

A Study of Potential Code Borrowing and License Violations in Java Projects on GitHub

An extensive corpus of popular Java projects from GitHub is compiled, an original analysis of possible code borrowing and license violations on the level of code fragments is performed, and it is discovered that 29.6% of blocks of code might be involved in potential code borrowed and 9.4% could potentially violate original licenses.

A Survey on Software Clone Detection Research

The state of the art in clone detection research is surveyed, the clone terms commonly used in the literature are described along with their corresponding mappings to the commonly used clone types and several open problems related to clone detectionResearch are pointed out.

Effects of cloned code on software maintainability: A replicated developer study

An extended replication of a controlled experiment that analyzes the effects of cloned bugs on the program comprehension of programmers showed that programmers performed significantly better when given clone information than without clone information.

Multi-threshold token-based code clone detection

A modification to bag-of-tokens based clone detection that allows detecting more clone pairs of greater diversity without losing precision by implementing a multi-threshold search, i.e. conducting the search several times, aimed at different groups of clones.

Measuring the Efficacy of Code Clone Information in a Bug Localization Task: An Empirical Study

The results of this study showed that participants who first identified a defect then used it to look for clones of the defect were more effective than participants who used the clone information before finding any defects.

SourcererCC: Scaling Code Clone Detection to Big-Code

This paper presents a token-based clone detector, SourcererCC, that can detect both exact and near-miss clones from large inter-project repositories using a standard workstation, and evaluates the scalability, execution time, recall and precision, and compares it to four publicly available and state-of-the-art tools.

Cross-project code clones in GitHub

An in-depth empirical study of cloning in GitHub, and a novel tool named CLONE-HUNTRESS that streamlines finding and tracking code clones in GitHub that is GitHub integrated, built around a user-friendly interface and runs efficiently over a modern database system.

Large-Scale Code Reuse in Open Source Software

  • A. Mockus
  • Computer Science
    First International Workshop on Emerging Trends in FLOSS Research and Development (FLOSS'07: ICSE Workshops 2007)
  • 2007
The authors' findings indicate that more than 50% of the files were used in more than one project and the most widely reused components were small and represented templates requiring major and minor modifications and a group of files reused without any change.