Corpus ID: 237572201

Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy

@inproceedings{Clement2021LongRangeMO,
  title={Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy},
  author={Colin Clement and Shuai Lu and Xiaoyu Liu and Michele Tufano and Dawn Drain and Nan Duan and Neel Sundaresan and Alexey Svyatkovskiy},
  booktitle={EMNLP},
  year={2021}
}
Statistical language modeling and translation with transformers have found many successful applications in program understanding and generation tasks, setting high benchmarks for tools in modern software development environments. The finite context window of these neural models means, however, that they will be unable to leverage the entire relevant context of large files and packages for any given task. While there are many efforts to extend the context window, we introduce an… Expand

Figures and Tables from this paper

References

SHOWING 1-10 OF 34 REFERENCES
code2seq: Generating Sequences from Structured Representations of Code
TLDR
This model represents a code snippet as the set of compositional paths in its abstract syntax tree and uses attention to select the relevant paths while decoding and significantly outperforms previous models that were specifically designed for programming languages, as well as state-of-the-art NMT models. Expand
PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers
TLDR
This work introduces PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. Expand
Code completion with statistical language models
TLDR
The main idea is to reduce the problem of code completion to a natural-language processing problem of predicting probabilities of sentences, and design a simple and scalable static analysis that extracts sequences of method calls from a large codebase, and index these into a statistical language model. Expand
Pythia: AI-assisted Code Completion System
TLDR
The architecture of the Pythia system is described, comparisons to frequency-based approach and invocation-based Markov Chain language model are performed, and challenges serving Pythia models on lightweight client devices are discussed. Expand
Probabilistic model for code with decision trees
TLDR
The key idea is to phrase the problem of learning a probabilistic model of code as learning a decision tree in a domain specific language over abstract syntax trees (called TGen), which allows us to condition the prediction of a program element on a dynamically computed context. Expand
Generating Accurate Assert Statements for Unit Test Cases using Pretrained Transformers
TLDR
This approach is based on a state-of-the-art transformer model initially pretrained on an English textual corpus, which is then trained in a semi-supervised fashion on a large corpus of source code, and finetune this model on the task of generating assert statements for unit tests. Expand
Improving Automatic Source Code Summarization via Deep Reinforcement Learning
  • Yao Wan, Zhou Zhao, +4 authors Philip S. Yu
  • Computer Science
  • 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE)
  • 2018
TLDR
An abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network) which provides the confidence of predicting the next word according to current state and an advantage reward composed of BLEU metric to train both networks. Expand
Automatic generation of natural language summaries for Java classes
TLDR
This paper presents a technique to automatically generate human readable summaries for Java classes, assuming no documentation exists, and determines that they are readable and understandable, they do not include extraneous information, and, in most cases, they are not missing essential information. Expand
CPC: Automatically Classifying and Propagating Natural Language Comments via Program Analysis
TLDR
A comprehensive taxonomy of comments is built and propagated comments are proposed to be used to systematically derive, refine, and propagate comments to detect new bugs in open source large projects. Expand
Learning from examples to improve code completion systems
TLDR
Evidence is given that intelligent code completion systems which learn from examples significantly outperform mainstream codepletion systems in terms of the relevance of their suggestions and thus have the potential to enhance developers' productivity. Expand
...
1
2
3
4
...