Mining source code repositories at massive scale using language modeling

@article{Allamanis2013MiningSC,
  title={Mining source code repositories at massive scale using language modeling},
  author={Miltiadis Allamanis and Charles Sutton},
  journal={2013 10th Working Conference on Mining Software Repositories (MSR)},
  year={2013},
  pages={207-216}
}
The tens of thousands of high-quality open source software projects on the Internet raise the exciting possibility of studying software development by finding patterns across truly large source code repositories. This could enable new tools for developing code, encouraging reuse, and navigating large projects. In this paper, we build the first giga-token probabilistic language model of source code, based on 352 million lines of Java. This is 100 times the scale of the pioneering work by Hindle… 
Automatic Builds of Large Software Repositories
TLDR
This work designed several heuristics to maximize the number of projects compiling successfully in a repository, and proposes several general, language independent heuristic to tackle the most common errors.
Function completion in the time of massive data: A code embedding perspective
TLDR
This work presents a novel approach for improving current function-calls completion tools by learning from independent code repositories, using well-known natural language processing models that can learn vector representation of source code (code embeddings).
Combining Code Embedding with Static Analysis for Function-Call Completion
TLDR
This work presents a novel approach for improving current function-calls completion tools by learning from independent code repositories, using well-known natural language processing models that can learn vector representation of source code (code embeddings).
Semantic Source Code Models Using Identifier Embeddings
  • V. Efstathiou, D. Spinellis
  • Computer Science
    2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
  • 2019
TLDR
This work produces pretrained vector space models, distributed code representations for six popular programming languages, namely, Java, Python, PHP, C, C++, and C#, trained on data from a single programming language.
The adverse effects of code duplication in machine learning models of code
TLDR
The effects of code duplication on machine learning models are explored, showing that reported performance metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machineLearning models of code are used by software engineers.
Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code
TLDR
This paper presents an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and shows that such models outperform the state of the art on three distinct code corpora (Java, C, Python).
Linguistic Change in Open Source Software
TLDR
These insights lay out a preliminary foundation for modeling the linguistic history of OSS projects, and will be utilized to provide support for basic software maintenance and program comprehension activities, and gain new theoretical insights into the complex interplay between linguistic change and various system and human aspects of O SS development.
Boa Meets Python: A Boa Dataset of Data Science Software in Python Language
TLDR
A new dataset that includes 1,558 mature Github projects that develop Python software for Data Science tasks, which use a diverse set of machine learning libraries and managed by a variety of users and organizations is created.
DeepClone: Modeling Clones to Generate Code Predictions
TLDR
A novel approach to facilitate code clone reuse, Deep-Clone, where a deep learning algorithm is utilized for modeling code clones and predicting the next possible set of tokens based on the user input so far, and generates code tokens using the model learned.
Estimating defectiveness of source code: A predictive model using GitHub content
TLDR
A method for building a dataset containing source code features extracted from source files taken from Open Source Software and associated bug reports and a predictive model for estimating defectiveness of a given source code is presented.
...
...

References

SHOWING 1-10 OF 17 REFERENCES
Sourcerer: mining and searching internet-scale software repositories
TLDR
By combining software textual content with structural information captured by the CodeRank approach, this work is able to significantly improve software retrieval performance, increasing the area under the curve (AUC) retrieval metric to 0.92, roughly 10–30% better than previous approaches based on text alone.
A study of the uniqueness of source code
TLDR
The first study of the uniqueness of source code is presented, examining a collection of 6,000 software projects and measuring the degree to which each project can be `assembled' solely from portions of this corpus, thus providing a precise measure of `uniqueness' that is called syntactic redundancy.
Jungloid mining: helping to navigate the API jungle
Reuse of existing code from class libraries and frameworks is often difficult because APIs are complex and the client code required to use the APIs can be hard to write. We observed that a common
Learning from 6,000 projects: lightweight cross-project anomaly detection
TLDR
Using a novel lightweight source code parser, more than 6,000 open source Linux projects are mined to obtain 16,000,000 temporal properties reflecting normal interface usage, and new projects can be checked against these rules to detect anomalies.
GHTorrent: Github's data from a firehose
TLDR
GHTorrent aims to create a scalable off line mirror of GitHub's event streams and persistent data, and offer it to the research community as a service.
Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
It is my distinct pleasure to also welcome you to the Eighteenth ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE-18). We have assembled a highquality technical
Efficient Malicious Code Detection Using N-Gram Analysis and SVM
TLDR
This paper proposes an approach that results in an effective n-gram feature extraction from malicious code for classifying executable as malicious or benign with the use of Support Vector Machines (SVM) as the machine learning classifier.
Guide to the Software Engineering Body of Knowledge (SWEBOK) and the Software Engineering Education Knowledge (SEEK) - a preliminary mapping
TLDR
The mapping shows that, though there are no major "school of thought" divergences between the two bodies of knowledge, there are a number of differences in the details of each breakdown in terms of vocabulary, level of detail, decomposition approach and topics encompassed.
Cognitive Perspectives on the Role of Naming in Computer Programs
TLDR
Examination of ways in which human cognition is reflected in the text of computer programs focuses on naming: the assignment of identifying labels to programmatic constructs.
...
...