gambit – An Open Source Name Disambiguation Tool for Version Control Systems

@article{Gote2021gambitA,
  title={gambit – An Open Source Name Disambiguation Tool for Version Control Systems},
  author={Christoph Gote and Christian Zingg},
  journal={2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)},
  year={2021},
  pages={80-84}
}
  • Christoph Gote, C. Zingg
  • Published 9 March 2021
  • Computer Science
  • 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)
Name disambiguation is a complex but highly relevant challenge whenever analysing real-world user data, such as data from version control systems. We propose gambit, a rule-based disambiguation tool that only relies on name and email information. We evaluate its performance against two commonly used algorithms with similar characteristics on manually disambiguated ground-truth data from the Gnome GTK project. Our results show that gambit significantly outperforms both algorithms, achieving an… 

Figures from this paper

Big Data = Big Insights? Operationalising Brooks' Law in a Massive GitHub Data Set
TLDR
This work studies challenges that can explain the disagreement between recent studies of developer productivity in massive repository data and provides the largest, curated corpus of GitHub projects tailored to investigate the influence of team size and collaboration patterns on individual and collective productivity.
Evolving Collaboration, Dependencies, and Use in the Rust Open Source Software Ecosystem
Open-source software (OSS) is widely spread in industry, research, and government. OSS represents an effective development model because it harnesses the decentralized efforts of many developers in a
The penumbra of open source: projects outside of centralized platforms are longer maintained, more academic and more collaborative
TLDR
A novel, extensive sample of public open source project repositories outside of centralized platforms is developed, characterized along a number of dimensions, and compared to a time-matched sample of corresponding GitHub projects.

References

SHOWING 1-10 OF 22 REFERENCES
Evaluating author name disambiguation for digital libraries: a case of DBLP
TLDR
DBLP’s author name disambiguation performs well even on large ambiguous name blocks but deficiently on distinguishing authors with the same names, possibly due to its hybrid disAmbiguation approach combining algorithmic disambigsuation and manual error correction.
Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records
TLDR
The first supervised learning approach for USPTO inventor disambiguation is provided, using random forests and trained on the authors' labeled optoelectronics dataset, and consistently maintains error rates below 3% across all of their available samples.
A comparison of identity merge algorithms for software repositories
ALFAA: Active Learning Fingerprint based Anti-Aliasing for correcting developer identity errors in version control systems
TLDR
The proposed Active Learning Fingerprint Based Anti-Aliasing approach will expedite research progress in the software engineering domain for applications that involve developer identities and indicates that correction of developer identity has a large impact on the inference of the social network.
git2net - Mining Time-Stamped Co-Editing Networks from Large git Repositories
TLDR
Git2net is introduced, a scalable python software that facilitates the extraction of fine-grained co-editing networks in large git repositories and is argued that it opens up a massive new source of high-resolution data on human collaboration patterns.
Who is Who in the Mailing List? Comparing Six Disambiguation Heuristics to Identify Multiple Addresses of a Participant
TLDR
Six heuristics from the literature are compared using data from 150 mailing lists from Apache Software Foundation projects and it is found that the Oliva et al. and a Naïve heuristic outperformed the others in most cases, when considering the F-measure metric.
Developer identification methods for integrated data from various sources
String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage.
TLDR
A formal method of modeling how to adjust matching weights between pure agreement and pure disagreement is presented and it is demonstrated that the theoretical rules of Fellegi and Sunter are still valid when general weighting adjustments accounting for partial agreement are performed.
Scikit-learn: Machine Learning in Python
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing
Interactive deduplication using active learning
TLDR
This work presents the design of a learning-based deduplication system that uses a novel method of interactively discovering challenging training pairs using active learning and investigates various design issues that arise in building a system to provide interactive response, fast convergence, and interpretable output.
...
...