CodeBERT: A Pre-Trained Model for Programming and Natural Languages

@inproceedings{Feng2020CodeBERTAP,
  title={CodeBERT: A Pre-Trained Model for Programming and Natural Languages},
  author={Zhangyin Feng and Daya Guo and Duyu Tang and Nan Duan and X. Feng and Ming Gong and Linjun Shou and Bing Qin and Ting Liu and Daxin Jiang and M. Zhou},
  booktitle={FINDINGS},
  year={2020}
}
We present CodeBERT, a bimodal pre-trained model for programming language (PL) and natural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language code search, code documentation generation, etc. We develop CodeBERT with Transformer-based neural architecture, and train it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled… Expand
Learning and Evaluating Contextual Embedding of Source Code
TLDR
This paper curates a massive, deduplicated corpus of 7.4M Python files from GitHub, and creates an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Expand
Efficient Framework for Learning Code Representations through Semantic-Preserving Program Transformations
TLDR
This work proposes Corder, a self-supervised learning system that can learn to represent code without having to label data, and shows that the Corder pre-training improves code classification and method name prediction with large margins. Expand
Unsupervised Translation of Programming Languages
TLDR
A fully unsupervised neural transcompiler that relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages is proposed. Expand
Exploring Software Naturalness through Neural Language Models
TLDR
This work is the first to investigate whether transformer-based language models can discover AST features automatically, and introduces a sequence labeling task that directly probes the language models understanding of AST. Expand
Fast and Memory-Efficient Neural Code Completion
TLDR
A modular neural framework for code completion is presented and a novel reranking neural completion model is designed that combines static analysis with granular token encodings and achieves 90% accuracy in its top five suggestions. Expand
Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent
TLDR
A domain-specific retrieval model for code annotated with a natural language description is created that yields significantly more relevant search results compared to state-of-the-art code retrieval methods that do not use descriptions. Expand
CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers for Analyzing Data Analysis
TLDR
This work proposes a novel weakly supervised transformer-based architecture for computing joint representations of code from both abstract syntax trees and surrounding natural language comments and finds that notebook composition correlates with the citation count of corresponding papers. Expand
A Structural Transformer with Relative Positions in Trees for Code-to-Sequence Tasks
  • 2020
We suggest two extensions to incorporate syntactic information into transformer models operating on linearized trees (e.g. abstract syntax trees). First, we use self-attention with relative positionExpand
Graph-based, Self-Supervised Program Repair from Diagnostic Feedback
TLDR
This work introduces a program-feedback graph, which connects symbols relevant to program repair in source code and diagnostic feedback, and then applies a graph neural network on top to model the reasoning process, and presents a self-supervised learning paradigm for program repair. Expand
Compositional Generalization in Semantic Parsing: Pre-training vs. Specialized Architectures
TLDR
It is shown that masked language model (MLM) pre-training rivals SCAN-inspired architectures on primitive holdout splits and establishes a new state of the art on the CFQ compositional generalization benchmark using MLM pre- training together with an intermediate representation. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 38 REFERENCES
Pre-trained Contextual Embedding of Source Code
TLDR
This work curates a massive corpus of Python programs from GitHub to pre-train a BERT model, which is then evaluated on a joint classification, localization and repair task involving prediction of two pointers and shows CuBERT's superiority when fine-tuned with smaller datasets, and over fewer epochs. Expand
code2seq: Generating Sequences from Structured Representations of Code
TLDR
This model represents a code snippet as the set of compositional paths in its abstract syntax tree and uses attention to select the relevant paths while decoding and significantly outperforms previous models that were specifically designed for programming languages, as well as state-of-the-art NMT models. Expand
Summarizing Source Code using a Neural Attention Model
TLDR
This paper presents the first completely datadriven approach for generating high level summaries of source code, which uses Long Short Term Memory (LSTM) networks with attention to produce sentences that describe C# code snippets and SQL queries. Expand
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. Expand
oLMpics-On What Language Model Pre-training Captures
TLDR
This work proposes eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition, and findings can help future work on designing new datasets, models, and objective functions for pre-training. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
TLDR
BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks. Expand
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
TLDR
The contextual representations learned by the proposed replaced token detection pre-training task substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute. Expand
Improving Language Understanding by Generative Pre-Training
TLDR
The general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. Expand
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier. Expand
...
1
2
3
4
...