Mining source code repositories at massive scale using language modeling
@article{Allamanis2013MiningSC, title={Mining source code repositories at massive scale using language modeling}, author={Miltiadis Allamanis and Charles Sutton}, journal={2013 10th Working Conference on Mining Software Repositories (MSR)}, year={2013}, pages={207-216} }
The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle… Expand
Supplemental Presentations
Figures, Tables, and Topics from this paper
215 Citations
Function completion in the time of massive data: A code embedding perspective
- Computer Science
- ArXiv
- 2020
Semantic Source Code Models Using Identifier Embeddings
- Computer Science
- 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
- 2019
- 3
- PDF
The adverse effects of code duplication in machine learning models of code
- Computer Science
- Onward!
- 2019
- 57
- PDF
Linguistic Change in Open Source Software
- Computer Science
- 2019 IEEE International Conference on Software Maintenance and Evolution (ICSME)
- 2019
Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
- Computer Science
- 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE)
- 2020
- 32
- PDF
Boa Meets Python: A Boa Dataset of Data Science Software in Python Language
- Computer Science
- 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
- 2019
- 6
- PDF
Estimating defectiveness of source code: A predictive model using GitHub content
- Computer Science
- ArXiv
- 2018
- 6
- PDF
References
SHOWING 1-10 OF 17 REFERENCES
Sourcerer: mining and searching internet-scale software repositories
- Computer Science
- Data Mining and Knowledge Discovery
- 2008
- 172
- PDF
Semantic clustering: Identifying topics in source code
- Computer Science
- Inf. Softw. Technol.
- 2007
- 473
- PDF
Learning from 6,000 projects: lightweight cross-project anomaly detection
- Computer Science
- ISSTA '10
- 2010
- 80
- PDF
GHTorrent: Github's data from a firehose
- Computer Science
- 2012 9th IEEE Working Conference on Mining Software Repositories (MSR)
- 2012
- 186
- PDF
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
- Engineering, Computer Science
- FSE 2010
- 2010
- 33
Efficient Malicious Code Detection Using N-Gram Analysis and SVM
- Computer Science
- 2011 14th International Conference on Network-Based Information Systems
- 2011
- 46
Guide to the Software Engineering Body of Knowledge (SWEBOK) and the Software Engineering Education Knowledge (SEEK) - a preliminary mapping
- Computer Science
- 10th International Workshop on Software Technology and Engineering Practice
- 2002
- 905