Mining source code repositories at massive scale using language modeling

  title={Mining source code repositories at massive scale using language modeling},
  author={Miltiadis Allamanis and Charles Sutton},
  journal={2013 10th Working Conference on Mining Software Repositories (MSR)},
The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle… Expand
215 Citations
Automatic Builds of Large Software Repositories
  • 1
Function completion in the time of massive data: A code embedding perspective
Combining Code Embedding with Static Analysis for Function-Call Completion.
  • PDF
Semantic Source Code Models Using Identifier Embeddings
  • V. Efstathiou, D. Spinellis
  • Computer Science
  • 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
  • 2019
  • 3
  • PDF
The adverse effects of code duplication in machine learning models of code
  • 57
  • PDF
Linguistic Change in Open Source Software
Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
  • 32
  • PDF
Boa Meets Python: A Boa Dataset of Data Science Software in Python Language
  • 6
  • PDF


Sourcerer: mining and searching internet-scale software repositories
  • 172
  • PDF
Semantic clustering: Identifying topics in source code
  • 473
  • PDF
A study of the uniqueness of source code
  • 192
  • PDF
Jungloid mining: helping to navigate the API jungle
  • 454
  • PDF
Learning from 6,000 projects: lightweight cross-project anomaly detection
  • 80
  • PDF
On the naturalness of software
  • 253
  • Highly Influential
GHTorrent: Github's data from a firehose
  • 186
  • PDF
Efficient Malicious Code Detection Using N-Gram Analysis and SVM
  • 46