Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

@article{Karampatsis2020BigC,
  title={Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code},
  author={Rafael-Michael Karampatsis and Hlib Babii and Romain Robbes and Charles Sutton and Andrea Janes},
  journal={2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE)},
  year={2020},
  pages={1073-1085}
}
Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural Language Models (NLMs) of source… 

Figures and Tables from this paper

Open-vocabulary models for source code
TLDR
This paper presents an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work, and outperforms the state of the art.
Open-Vocabulary Models for Source Code (Extended Abstract)
TLDR
This paper presents an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work, and outperforms the state of the art.
A Systematic Evaluation of Large Language Models of Code
TLDR
It is found that existing open-source models do achieve close results in some programming languages, although targeted mainly for natural language modeling, and a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, that was trained on 249GB of code across 12 programming languages on a single machine.
Can Identifier Splitting Improve Open-Vocabulary Language Model of Code?
TLDR
This paper proposes to split identifiers in both constructing vocabulary and processing model inputs procedures, thus exploiting three different settings of applying identifier splitting to language models for the code completion task and finds that simply inserting identifier splitting into the pipeline hurts the model performance, while a hybrid strategy combining identifier splitting and the BPE algorithm can outperform the original open-vocabulary models on predicting identifiers.
Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation
TLDR
It is argued that fairly naive information retrieval methods do well enough at this task to be considered a reasonable baseline, and some suggestions on how the findings might be used in future research in this area are made.
NaturalCC: A Toolkit to Naturalize the Source Code Corpus
TLDR
NaturalCC is an efficient and extensible toolkit to bridge the gap between natural language and programming language, and facilitate the research on big code analysis, and is built upon Fairseq and PyTorch.
Function completion in the time of massive data: A code embedding perspective
TLDR
This work presents a novel approach for improving current function-calls completion tools by learning from independent code repositories, using well-known natural language processing models that can learn vector representation of source code (code embeddings).
Multi-task Learning based Pre-trained Language Model for Code Completion
TLDR
A multi-task learning based pre-trained language model for code understanding and code generation with a Transformer-based neural architecture that adopts multi- task learning to predict the token and its type jointly and utilize the predicted type to assist the token prediction.
ReACC: A Retrieval-Augmented Code Completion Framework
TLDR
This work proposes a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval, and adopts a stage-wise training approach that combines a source code retriever and an auto-regressive language model for programming language.
A Novel Self-Attention Based Automatic Code Completion Neural Network
TLDR
A novel automatic code completion neural network based on a self-attention mechanism with open vocabulary to address issues of OOV, slow training speed, and lacking long context-dependency is presented.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 100 REFERENCES
Maybe Deep Neural Networks are the Best Choice for Modeling Source Code
TLDR
This work presents a new open-vocabulary neural language model for code that is not limited to a fixed vocabulary of identifier names, and achieves best in class performance, outperforming even the state-of-the-art methods of Hellendoorn and Devanbu that are designed specifically to model code.
Modeling Vocabulary for Big Code Machine Learning
TLDR
It is shown that a subset of decisions have decisive characteristics, allowing to train accurate Neural Language Models quickly on a large corpus of 10,106 projects.
Are deep neural networks the best choice for modeling source code?
TLDR
This work enhances established language modeling approaches to handle the special challenges of modeling source code, such as frequent changes, larger, changing vocabularies, deeply nested scopes, etc, and presents a fast, nested language modeling toolkit specifically designed for software.
Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow
TLDR
A novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks.
Mining source code repositories at massive scale using language modeling
TLDR
This paper builds the first giga-token probabilistic language model of source code, based on 352 million lines of Java, and proposes new metrics that measure the complexity of a code module and the topical centrality of a module to a software project.
Lexical statistical machine translation for language migration
TLDR
This paper treats source code as a sequence of lexical tokens and applies a phrase-based SMT model on the lexemes of those tokens, which shows a high percentage of total translation methods is syntactically incorrect.
A deep language model for software code
TLDR
This paper proposes a novel approach to build a language model for software code that is built upon the powerful deep learning-based Long Short Term Memory architecture, capable of learning long-term dependencies which occur frequently in software code.
Summarizing Source Code using a Neural Attention Model
TLDR
This paper presents the first completely datadriven approach for generating high level summaries of source code, which uses Long Short Term Memory (LSTM) networks with attention to produce sentences that describe C# code snippets and SQL queries.
A statistical semantic language model for source code
TLDR
SLAMC is introduced, a novel statistical semantic language model for source code that incorporates semantic information into code tokens and models the regularities/patterns of such semantic annotations, called sememes, rather than their lexemes.
code2seq: Generating Sequences from Structured Representations of Code
TLDR
This model represents a code snippet as the set of compositional paths in its abstract syntax tree and uses attention to select the relevant paths while decoding and significantly outperforms previous models that were specifically designed for programming languages, as well as state-of-the-art NMT models.
...
1
2
3
4
5
...