Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy

  title={Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy},
  author={Colin Clement and Shuai Lu and Xiaoyu Liu and Michele Tufano and Dawn Drain and Nan Duan and Neel Sundaresan and Alexey Svyatkovskiy},
Statistical language modeling and translation with transformers have found many successful applications in program understanding and generation tasks, setting high benchmarks for tools in modern software development environments. The finite context window of these neural models means, however, that they will be unable to leverage the entire relevant context of large files and packages for any given task. While there are many efforts to extend the context window, we introduce an… 

Figures and Tables from this paper


code2seq: Generating Sequences from Structured Representations of Code
This model represents a code snippet as the set of compositional paths in its abstract syntax tree and uses attention to select the relevant paths while decoding and significantly outperforms previous models that were specifically designed for programming languages, as well as state-of-the-art NMT models.
PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers
This work introduces PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style.
Code completion with statistical language models
The main idea is to reduce the problem of code completion to a natural-language processing problem of predicting probabilities of sentences, and design a simple and scalable static analysis that extracts sequences of method calls from a large codebase, and index these into a statistical language model.
Pythia: AI-assisted Code Completion System
The architecture of the Pythia system is described, comparisons to frequency-based approach and invocation-based Markov Chain language model are performed, and challenges serving Pythia models on lightweight client devices are discussed.
Probabilistic model for code with decision trees
The key idea is to phrase the problem of learning a probabilistic model of code as learning a decision tree in a domain specific language over abstract syntax trees (called TGen), which allows us to condition the prediction of a program element on a dynamically computed context.
Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers
This approach is based on a state-of-the-art transformer model initially pretrained on an English textual corpus, which is then trained in a semi-supervised fashion on a large corpus of source code, and finetune this model on the task of generating assert statements for unit tests.
Improving Automatic Source Code Summarization via Deep Reinforcement Learning
  • Yao Wan, Zhou Zhao, +4 authors Philip S. Yu
  • Computer Science
    2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE)
  • 2018
An abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network) which provides the confidence of predicting the next word according to current state and an advantage reward composed of BLEU metric to train both networks.
Automatic generation of natural language summaries for Java classes
This paper presents a technique to automatically generate human readable summaries for Java classes, assuming no documentation exists, and determines that they are readable and understandable, they do not include extraneous information, and, in most cases, they are not missing essential information.
CPC: Automatically Classifying and Propagating Natural Language Comments via Program Analysis
A comprehensive taxonomy of comments is built and propagated comments are proposed to be used to systematically derive, refine, and propagate comments to detect new bugs in open source large projects.
Learning from examples to improve code completion systems
Evidence is given that intelligent code completion systems which learn from examples significantly outperform mainstream codepletion systems in terms of the relevance of their suggestions and thus have the potential to enhance developers' productivity.