Sampling Projects in GitHub for MSR Studies

@article{Dabic2021SamplingPI,
  title={Sampling Projects in GitHub for MSR Studies},
  author={Ozren Dabic and Emad Aghajani and Gabriele Bavota},
  journal={2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)},
  year={2021},
  pages={560-564}
}
Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria dictated by the study goal. For example, a study related to licensing might be interested in selecting projects explicitly declaring a license. Once the selection criteria have been defined, utilities such as the GitHub APIs can be used to "query" the hosting… 

Figures and Tables from this paper

Recommending Code Improvements Based on Stack Overflow Answer Edits
TLDR
Matcha, a code recommendation tool that leverages Stack Overflow code snippets with version history and code clone search techniques to identify sub-optimal code in software projects and suggest their optimised version is proposed.
PyMigBench and PyMigTax: A Benchmark and Taxonomy for Python Library Migration
This is a preprint version and is submitted for publication Abstract Developers heavily rely on Application Programming Interfaces (APIs) from libraries to build their projects. However, libraries
Do Visual Issue Reports Help Developers Fix Bugs?: - A Preliminary Study of Using Videos and Images to Report Issues on GitHub -
TLDR
This preliminary analysis shows that issue reports with images are described in fewer words than non-visual issue reports, and that most dis-cussions in visual issue reports are concerned with either conditions for reproduction or GUI (e.g., when).
On the Bug-proneness of Structures Inspired by Functional Programming in JavaScript Projects
TLDR
The prevalence of four concepts typically associated with functional programming in JavaScript, recursion, immutability, lazy evaluation, and functions as values are quantified to suggest that functional programming concepts are important for developers using a multi-paradigm language such as JavaScript, and their usage does not make programs harder to understand.
On the Accuracy of Bot Detection Techniques
TLDR
It is shown that none of the bot detection techniques are accurate enough to detect bots among the 20 most active contributors of each project, and combining these techniques drastically increases the accuracy and recall of bot detection.
S COTCH : A S EMANTIC C ODE S EARCH E NGINE FOR IDE S
TLDR
Results from automated as well as human evaluation suggest that the inclusion of code context in search significantly improves the retrieval of the correct code snippet but slightly impairs ranking quality among code snippets.
How are Software Repositories Mined? A Systematic Literature Review of Workflows, Methodologies, Reproducibility, and Tools
TLDR
It is found that an important part of the research workflow involving dataset selection was particularly problematic, which raises questions about the generality of the results in existing literature and proposes ways to address these shortcomings via existing tools.
A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts
TLDR
A comparison between Python code written in Jupyter Notebooks and in traditional Python scripts is carried out and it is demonstrated that notebooks are characterized by the lower code complexity, however, their code could be perceived as more entangled than in the scripts.
Lupa: A Framework for Large Scale Analysis of the Programming Language Usage
TLDR
Lupa is a command line tool that uses the power of the IntelliJ Platform under the hood, which gives it access to powerful static analysis tools used in modern IDEs.
Static Analysis Warnings and Automatic Fixing: A Replication for C# Projects
TLDR
This paper investigates to what extent some known results about using static analyzers for Java change when considering C#—another popular object-oriented language and develops and empirically evaluates EagleRepair: a technique to automatically fix code in response to static analysis warnings.
...
...

References

SHOWING 1-10 OF 27 REFERENCES
Software Heritage: Why and How to Preserve Software Source Code
TLDR
This paper presents Software Heritage, an ambitious initiative to collect, preserve, and share the entire corpus of publicly accessible software source code, and discusses the archival goals, use cases and role as a participant in the broader digital preservation ecosystem, and detail its key design decisions.
The GHTorent dataset and tool suite
  • Georgios Gousios
  • Computer Science
    2013 10th Working Conference on Mining Software Repositories (MSR)
  • 2013
TLDR
The GHTorent project has been collecting data for all public projects available on Github for more than a year, and the dataset details and construction process are presented.
Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub
TLDR
This study investigates and answers various research questions on the popularity and impact of issue trackers, and performs an empirical study on a hundred thousands of open source projects.
A Study on the Interplay between Pull Request Review and Continuous Integration Builds
TLDR
This paper empirically investigates the interplay between pull request discussion and the use of CI by means of 64,865 pull request discussions belonging to 69 open source projects and qualitatively analyzes the content of 857 pullrequest discussions.
A Tool to Extract Structured Data from GitHub
TLDR
A tool support, named GitRepository, is developed, which helps in creating a data-set of repositories based on the proposed schema, which hosts 620 repositories (with applied basic filters of stars and forks), and 247 repositories (after applying all pre-defined filters).
Detecting Video Game-Specific Bad Smells in Unity Projects
TLDR
This paper proposes UnityLinter, a static analysis tool that supports Unity video game developers to detect seven types of bad smells the authors have identified as relevant in video game development, which pertain to performance, maintainability and incorrect behavior problems.
An Empirical Study of Method Chaining in Java
TLDR
Whether method chaining is a programming style accepted by real-world programmers is investigated, and language features that are helpful to the method-chaining style but have not been supported yet in Java are explored.
On the Prevalence, Impact, and Evolution of SQL Code Smells in Data-Intensive Systems
TLDR
The results show that SQL code smells are indeed prevalent and persistent in the studied data-intensive software systems and have a weaker association with bugs than that of traditional code smells.
An Empirical Study of Quick Remedy Commits
TLDR
A qualitative study investigating "quick remedy commits" performed by developers with the goal of implementing changes omitted in previous commits is presented, and a taxonomy categorizing the types of changes that developers tend to omit is defined.
Developer-Driven Code Smell Prioritization
TLDR
This paper proposes an approach based on machine learning able to rank code smells according to the perceived criticality that developers assign to them and performs a first step toward the concept of developer-driven code smell prioritization.
...
...