A Survey of Machine Learning for Big Code and Naturalness

@article{Allamanis2018ASO,
  title={A Survey of Machine Learning for Big Code and Naturalness},
  author={Miltiadis Allamanis and Earl T. Barr and Premkumar T. Devanbu and Charles Sutton},
  journal={ACM Computing Surveys (CSUR)},
  year={2018},
  volume={51},
  pages={1 - 37}
}
Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code. [] Key Method We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges…

Tables from this paper

ComPy-Learn: A toolbox for exploring machine learning representations for compilers
TLDR
ComPy-Learn is presented, a toolbox for conveniently defining, extracting, and exploring representations of program code and enables an efficient search for the best-performing representation and model for tasks on program code.
Neural Networks for Modeling Source Code Edits
TLDR
This work treats source code as a dynamic object and tackles the problem of modeling the edits that software developers make to source code files, and concludes that a new composition of attentional and pointer network components provides the best overall performance and scalability.
Learning to Represent Programs with Graphs
TLDR
This work proposes to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures, and suggests that these models learn to infer meaningful names and to solve the VarMisuse task in many cases.
Neural Code Comprehension: A Learnable Representation of Code Semantics
TLDR
A novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks, and shows that even without fine-tuning, a single RNN architecture and fixed inst2vec embeddings outperform specialized approaches for performance prediction and algorithm classification from raw code.
Neural Code Comprehension : A Learnable Representation of Code Semantics
TLDR
A novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks, and shows that even without fine-tuning, a single RNN architecture and fixed inst2vec embeddings outperform specialized approaches for performance prediction and algorithm classification from raw code.
A Comparison of Code Embeddings and Beyond
TLDR
This paper systemically evaluates the performance of eight program representation learning models on three common tasks, where six models are based on abstract syntax trees and two models areBased on plain text of source code, and applies a prediction attribution technique to find what elements are captured and responsible for the predictions in each task.
Machine Learning in Compilers
TLDR
The relationship between machine learning and compiler optimisation is described and the main concepts of features, models, training and deployment are introduced to provide a comprehensive survey and a road map for the wide variety of different research areas.
Machine Learning in Compiler Optimization
TLDR
The relationship between machine learning and compiler optimization is described and the main concepts of features, models, training, and deployment are introduced and a road map for the wide variety of different research areas is provided.
Commit2Vec: Learning Distributed Representations of Code Changes
TLDR
This work elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and it adapts it to represent source changes (i.e., commits).
...
...

References

SHOWING 1-10 OF 313 REFERENCES
Programming with "Big Code": Lessons, Techniques and Applications
TLDR
This paper summarizes some of the experiences and insights obtained by developing several probabilistic systems over the last few years, and presents a prediction approach suitable as a starting point for building probabilism tools, and discusses a practical framework implementing this approach, called Nice2Predict.
Learning to Represent Programs with Graphs
TLDR
This work proposes to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures, and suggests that these models learn to infer meaningful names and to solve the VarMisuse task in many cases.
Neural Sketch Learning for Conditional Program Generation
TLDR
This work trains a neural generator not on code but on program sketches, or models of program syntax that abstract out names and operations that do not generalize across programs, and shows that it can often predict the entire body of a method given just a few API calls or data types that appear in the method.
Synthesizing benchmarks for predictive modeling
TLDR
The authors' generator for OpenCL programs, CLgen, is used to automatically synthesize thousands of programs and it is shown that learning over these improves the performance of a state of the art predictive model by 1.27x.
A Machine Learning Framework for Programming by Example
TLDR
It is shown how machine learning can be used to speed up this seemingly hopeless search problem, by learning weights that relate textual features describing the provided input-output examples to plausible sub-components of a program.
Toward Deep Learning Software Repositories
TLDR
This work motivate deep learning for software language modeling, highlighting fundamental differences between state-of-the-practice software language models and connectionist models, and proposes avenues for future work, where deep learning can be brought to bear to support model-based testing, improve software lexicons, and conceptualize software artifacts.
Structured Generative Models of Natural Source Code
TLDR
A family of generative models for NSC that have three key properties: first, they incorporate both sequential and hierarchical structure, second, they learn a distributed representation of source code elements, and third, they integrate closely with a compiler.
Neural Code Completion
TLDR
This paperores the use of neural network techniques to automatically learn code completion from a large corpus of dynamically typed JavaScript code, and shows different neural networks that leverage not only token level information but also structural information, and evaluates their performance on different prediction tasks.
DeepCoder: Learning to Write Programs
TLDR
The approach is to train a neural network to predict properties of the program that generated the outputs from the inputs to augment search techniques from the programming languages community, including enumerative search and an SMT-based solver.
A deep language model for software code
TLDR
This paper proposes a novel approach to build a language model for software code that is built upon the powerful deep learning-based Long Short Term Memory architecture, capable of learning long-term dependencies which occur frequently in software code.
...
...