Sampling Projects in GitHub for MSR Studies

  title={Sampling Projects in GitHub for MSR Studies},
  author={Ozren Dabic and Emad Aghajani and Gabriele Bavota},
  journal={2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)},
Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria dictated by the study goal. For example, a study related to licensing might be interested in selecting projects explicitly declaring a license. Once the selection criteria have been defined, utilities such as the GitHub APIs can be used to "query" the hosting… 

Figures and Tables from this paper

Recommending Code Improvements Based on Stack Overflow Answer Edits
Matcha, a code recommendation tool that leverages Stack Overflow code snippets with version history and code clone search techniques to identify sub-optimal code in software projects and suggest their optimised version is proposed.
PyMigBench and PyMigTax: A Benchmark and Taxonomy for Python Library Migration
The nature of Python library migrations in open-source systems is investigated, the code changes that happen during library migration are analyzed and a taxonomy of migrations is created, PyMigTax, that categorizes migrations across various dimensions.
An Exploratory Study of Documentation Strategies for Product Features in Popular GitHub Projects
The results suggest a lacking (or a low-prioritised) feature documentation in open-source projects, little use of normalised structures, and a rare explicit referencing to source code.
What Made This Test Flake? Pinpointing Classes Responsible for Test Flakiness
Spectrum-Based Fault Localisation (SBFL), a coverage-based fault localisation technique commonly adopted for its simplicity and effectiveness is employed, with results showing that localisation methods are effective in major flakiness categories, such as concurrency and asynchronous waits, indicating their general ability to identifyflaky components.
Do Visual Issue Reports Help Developers Fix Bugs?: - A Preliminary Study of Using Videos and Images to Report Issues on GitHub -
This preliminary analysis shows that issue reports with images are described in fewer words than non-visual issue reports, and that most dis-cussions in visual issue reports are concerned with either conditions for reproduction or GUI (e.g., when).
On the Bug-proneness of Structures Inspired by Functional Programming in JavaScript Projects
The prevalence of four concepts typically associated with functional programming in JavaScript, recursion, immutability, lazy evaluation, and functions as values are quantified to suggest that functional programming concepts are important for developers using a multi-paradigm language such as JavaScript, and their usage does not make programs harder to understand.
On the Accuracy of Bot Detection Techniques
It is shown that none of the bot detection techniques are accurate enough to detect bots among the 20 most active contributors of each project, and combining these techniques drastically increases the accuracy and recall of bot detection.
Results from automated as well as human evaluation suggest that the inclusion of code context in search significantly improves the retrieval of the correct code snippet but slightly impairs ranking quality among code snippets.
How are Software Repositories Mined? A Systematic Literature Review of Workflows, Methodologies, Reproducibility, and Tools
It is found that an important part of the research workflow involving dataset selection was particularly problematic, which raises questions about the generality of the results in existing literature and proposes ways to address these shortcomings via existing tools.
A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts
A comparison between Python code written in Jupyter Notebooks and in traditional Python scripts is carried out and it is demonstrated that notebooks are characterized by the lower code complexity, however, their code could be perceived as more entangled than in the scripts.


Software Heritage: Why and How to Preserve Software Source Code
This paper presents Software Heritage, an ambitious initiative to collect, preserve, and share the entire corpus of publicly accessible software source code, and discusses the archival goals, use cases and role as a participant in the broader digital preservation ecosystem, and detail its key design decisions.
The GHTorent dataset and tool suite
  • Georgios Gousios
  • Computer Science
    2013 10th Working Conference on Mining Software Repositories (MSR)
  • 2013
The GHTorent project has been collecting data for all public projects available on Github for more than a year, and the dataset details and construction process are presented.
Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub
This study investigates and answers various research questions on the popularity and impact of issue trackers, and performs an empirical study on a hundred thousands of open source projects.
A Study on the Interplay between Pull Request Review and Continuous Integration Builds
This paper empirically investigates the interplay between pull request discussion and the use of CI by means of 64,865 pull request discussions belonging to 69 open source projects and qualitatively analyzes the content of 857 pullrequest discussions.
A Tool to Extract Structured Data from GitHub
A tool support, named GitRepository, is developed, which helps in creating a data-set of repositories based on the proposed schema, which hosts 620 repositories (with applied basic filters of stars and forks), and 247 repositories (after applying all pre-defined filters).
Detecting Video Game-Specific Bad Smells in Unity Projects
This paper proposes UnityLinter, a static analysis tool that supports Unity video game developers to detect seven types of bad smells the authors have identified as relevant in video game development, which pertain to performance, maintainability and incorrect behavior problems.
An Empirical Study of Method Chaining in Java
Whether method chaining is a programming style accepted by real-world programmers is investigated, and language features that are helpful to the method-chaining style but have not been supported yet in Java are explored.
On the Prevalence, Impact, and Evolution of SQL Code Smells in Data-Intensive Systems
The results show that SQL code smells are indeed prevalent and persistent in the studied data-intensive software systems and have a weaker association with bugs than that of traditional code smells.
An Empirical Study of Quick Remedy Commits
A qualitative study investigating "quick remedy commits" performed by developers with the goal of implementing changes omitted in previous commits is presented, and a taxonomy categorizing the types of changes that developers tend to omit is defined.
Developer-Driven Code Smell Prioritization
This paper proposes an approach based on machine learning able to rank code smells according to the perceived criticality that developers assign to them and performs a first step toward the concept of developer-driven code smell prioritization.