code2vec: learning distributed representations of code

@article{Alon2019code2vecLD,
  title={code2vec: learning distributed representations of code},
  author={Uri Alon and Meital Zilberstein and Omer Levy and Eran Yahav},
  journal={Proceedings of the ACM on Programming Languages},
  year={2019},
  volume={3},
  pages={1 - 29}
}
We present a neural model for representing snippets of code as continuous distributed vectors (``code embeddings. [...] Key Method To this end, code is first decomposed to a collection of paths in its abstract syntax tree. Then, the network learns the atomic representation of each path while simultaneously learning how to aggregate a set of them. We demonstrate the effectiveness of our approach by using it to predict a method's name from the vector representation of its body. We evaluate our approach by…Expand
WhenDeep LearningMet Code Search
There have been multiple recent proposals on using deep neural networks for code search using natural language. Common across these proposals is the idea of embedding code and natural languageExpand
Neural Code Comprehension : A Learnable Representation of Code Semantics
With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly orExpand
Commit2Vec: Learning Distributed Representations of Code Changes
TLDR
This work elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and it adapts it to represent source changes (i.e., commits). Expand
When deep learning met code search
TLDR
This paper assembled implementations of state-of-the-art techniques to run on a common platform, training and evaluation corpora, and introduced a new design point that is a minimal supervision extension to an existing unsupervised technique. Expand
code2seq: Generating Sequences from Structured Representations of Code
TLDR
This model represents a code snippet as the set of compositional paths in its abstract syntax tree and uses attention to select the relevant paths while decoding and significantly outperforms previous models that were specifically designed for programming languages, as well as state-of-the-art NMT models. Expand
Neural Code Comprehension: A Learnable Representation of Code Semantics
TLDR
A novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks, and shows that even without fine-tuning, a single RNN architecture and fixed inst2vec embeddings outperform specialized approaches for performance prediction and algorithm classification from raw code. Expand
A Source Code Similarity Based on Siamese Neural Network
TLDR
A Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning and improves some performance over single word embedding method. Expand
Code vectors: understanding programs through embedded abstracted symbolic traces
TLDR
This paper uses abstractions of traces obtained from symbolic execution of a program as a representation for learning word embeddings and shows thatembeddings learned from semantic abstractions provide nearly triple the accuracy of those learned from syntactic abstractions. Expand
patch2vec: Distributed Representation of Code Changes
TLDR
This work elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and it adapts it to represent source changes (i.e., commits). Expand
Commit2Vec: Learning Distributed Representations of Code Changes [PRE-PRINT]
Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to theExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 119 REFERENCES
Learning to Represent Programs with Graphs
TLDR
This work proposes to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures, and suggests that these models learn to infer meaningful names and to solve the VarMisuse task in many cases. Expand
A general path-based representation for predicting program properties
TLDR
A general path-based representation to represent a program using paths in its abstract syntax tree (AST) allows a learning model to leverage the structured nature of code rather than treating it as a flat sequence of tokens. Expand
Summarizing Source Code using a Neural Attention Model
TLDR
This paper presents the first completely datadriven approach for generating high level summaries of source code, which uses Long Short Term Memory (LSTM) networks with attention to produce sentences that describe C# code snippets and SQL queries. Expand
Toward Deep Learning Software Repositories
TLDR
This work motivate deep learning for software language modeling, highlighting fundamental differences between state-of-the-practice software language models and connectionist models, and proposes avenues for future work, where deep learning can be brought to bear to support model-based testing, improve software lexicons, and conceptualize software artifacts. Expand
Suggesting accurate method and class names
TLDR
A neural probabilistic language model for source code that is specifically designed for the method naming problem is introduced, and a variant of the model is introduced that is, to the knowledge, the first that can propose neologisms, names that have not appeared in the training corpus. Expand
Neural Sketch Learning for Conditional Program Generation
TLDR
This work trains a neural generator not on code but on program sketches, or models of program syntax that abstract out names and operations that do not generalize across programs, and shows that it can often predict the entire body of a method given just a few API calls or data types that appear in the method. Expand
A Convolutional Attention Network for Extreme Summarization of Source Code
TLDR
An attentional neural network that employs convolution on the input tokens to detect local time-invariant and long-range topical attention features in a context-dependent way to solve the problem of extreme summarization of source code snippets into short, descriptive function name-like summaries is introduced. Expand
Predicting Program Properties from "Big Code"
TLDR
This work formulating the problem of inferring program properties as structured prediction and showing how to perform both learning and inference in this context opens up new possibilities for attacking a wide range of difficult problems in the context of "Big Code" including invariant generation, decompilation, synthesis and others. Expand
Leveraging a corpus of natural language descriptions for program similarity
TLDR
The approach can determine semantic relatedness and similarity of code across different libraries and even across different programming languages, a task considered extremely difficult using traditional approaches. Expand
Learning programs from noisy data
TLDR
A novel regularized bitstream synthesizer is introduced that successfully generates programs even in the presence of incorrect examples and can detect errors in the examples while combating overfitting -- a major problem in existing synthesis techniques. Expand
...
1
2
3
4
5
...