A Survey of Machine Learning for Big Code and Naturalness
@article{Allamanis2018ASO, title={A Survey of Machine Learning for Big Code and Naturalness}, author={Miltiadis Allamanis and Earl T. Barr and Premkumar T. Devanbu and Charles Sutton}, journal={ACM Computing Surveys (CSUR)}, year={2018}, volume={51}, pages={1 - 37} }
Research at the intersection of machine learning, programming languages, and software engineering has recently taken important steps in proposing learnable probabilistic models of source code that exploit the abundance of patterns of code. [] Key Method We present a taxonomy based on the underlying design principles of each model and use it to navigate the literature. Then, we review how researchers have adapted these models to application areas and discuss cross-cutting and application-specific challenges…
491 Citations
ComPy-Learn: A toolbox for exploring machine learning representations for compilers
- Computer Science2020 Forum for Specification and Design Languages (FDL)
- 2020
ComPy-Learn is presented, a toolbox for conveniently defining, extracting, and exploring representations of program code and enables an efficient search for the best-performing representation and model for tasks on program code.
Neural Networks for Modeling Source Code Edits
- Computer ScienceArXiv
- 2019
This work treats source code as a dynamic object and tackles the problem of modeling the edits that software developers make to source code files, and concludes that a new composition of attentional and pointer network components provides the best overall performance and scalability.
Learning to Represent Programs with Graphs
- Computer ScienceICLR
- 2018
This work proposes to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures, and suggests that these models learn to infer meaningful names and to solve the VarMisuse task in many cases.
Neural Code Comprehension: A Learnable Representation of Code Semantics
- Computer ScienceNeurIPS
- 2018
A novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks, and shows that even without fine-tuning, a single RNN architecture and fixed inst2vec embeddings outperform specialized approaches for performance prediction and algorithm classification from raw code.
Neural Code Comprehension : A Learnable Representation of Code Semantics
- Computer Science
- 2018
A novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks, and shows that even without fine-tuning, a single RNN architecture and fixed inst2vec embeddings outperform specialized approaches for performance prediction and algorithm classification from raw code.
A Comparison of Code Embeddings and Beyond
- Computer ScienceArXiv
- 2021
This paper systemically evaluates the performance of eight program representation learning models on three common tasks, where six models are based on abstract syntax trees and two models areBased on plain text of source code, and applies a prediction attribution technique to find what elements are captured and responsible for the predictions in each task.
Machine Learning in Compilers
- Computer Science
- 2018
The relationship between machine learning and compiler optimisation is described and the main concepts of features, models, training and deployment are introduced to provide a comprehensive survey and a road map for the wide variety of different research areas.
Machine Learning in Compiler Optimization
- Computer ScienceProceedings of the IEEE
- 2018
The relationship between machine learning and compiler optimization is described and the main concepts of features, models, training, and deployment are introduced and a road map for the wide variety of different research areas is provided.
Cnerator: A Python application for the controlled stochastic generation of standard C source code
- Computer ScienceSoftwareX
- 2021
Commit2Vec: Learning Distributed Representations of Code Changes
- Computer ScienceSN Comput. Sci.
- 2021
This work elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and it adapts it to represent source changes (i.e., commits).
References
SHOWING 1-10 OF 313 REFERENCES
Programming with "Big Code": Lessons, Techniques and Applications
- Computer ScienceSNAPL
- 2015
This paper summarizes some of the experiences and insights obtained by developing several probabilistic systems over the last few years, and presents a prediction approach suitable as a starting point for building probabilism tools, and discusses a practical framework implementing this approach, called Nice2Predict.
Learning to Represent Programs with Graphs
- Computer ScienceICLR
- 2018
This work proposes to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures, and suggests that these models learn to infer meaningful names and to solve the VarMisuse task in many cases.
Neural Sketch Learning for Conditional Program Generation
- Computer ScienceICLR
- 2018
This work trains a neural generator not on code but on program sketches, or models of program syntax that abstract out names and operations that do not generalize across programs, and shows that it can often predict the entire body of a method given just a few API calls or data types that appear in the method.
Synthesizing benchmarks for predictive modeling
- Computer Science2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
- 2017
The authors' generator for OpenCL programs, CLgen, is used to automatically synthesize thousands of programs and it is shown that learning over these improves the performance of a state of the art predictive model by 1.27x.
A Machine Learning Framework for Programming by Example
- Computer ScienceICML
- 2013
It is shown how machine learning can be used to speed up this seemingly hopeless search problem, by learning weights that relate textual features describing the provided input-output examples to plausible sub-components of a program.
Toward Deep Learning Software Repositories
- Computer Science2015 IEEE/ACM 12th Working Conference on Mining Software Repositories
- 2015
This work motivate deep learning for software language modeling, highlighting fundamental differences between state-of-the-practice software language models and connectionist models, and proposes avenues for future work, where deep learning can be brought to bear to support model-based testing, improve software lexicons, and conceptualize software artifacts.
Structured Generative Models of Natural Source Code
- Computer ScienceICML
- 2014
A family of generative models for NSC that have three key properties: first, they incorporate both sequential and hierarchical structure, second, they learn a distributed representation of source code elements, and third, they integrate closely with a compiler.
Neural Code Completion
- Computer Science
- 2017
This paperores the use of neural network techniques to automatically learn code completion from a large corpus of dynamically typed JavaScript code, and shows different neural networks that leverage not only token level information but also structural information, and evaluates their performance on different prediction tasks.
DeepCoder: Learning to Write Programs
- Computer ScienceICLR
- 2017
The approach is to train a neural network to predict properties of the program that generated the outputs from the inputs to augment search techniques from the programming languages community, including enumerative search and an SMT-based solver.
A deep language model for software code
- Computer ScienceFSE 2016
- 2016
This paper proposes a novel approach to build a language model for software code that is built upon the powerful deep learning-based Long Short Term Memory architecture, capable of learning long-term dependencies which occur frequently in software code.