CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

@article{Wang2021CodeT5IU,
  title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation},
  author={Yue Wang and Weishi Wang and Shafiq R. Joty and Steven C. H. Hoi},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.00859}
}
Pre-trained models for Natural Languages (NL) like BERT and GPT have been recently shown to transfer well to Programming Languages (PL) and largely benefit a broad set of code-related tasks. Despite their success, most current methods either rely on an encoder-only (or decoder-only) pre-training that is suboptimal for generation (resp. understanding) tasks or process the code snippet in the same way as NL, neglecting the special characteristics of PL such as token types. We present CodeT5, a… 

Figures and Tables from this paper

UniXcoder: Unified Cross-Modal Pre-training for Code Representation
TLDR
Results show that the model achieves state-of-the-art performance on most tasks and analysis reveals that comment and AST can both enhance UniXcoder.
GypSum: Learning Hybrid Representations for Code Summarization
TLDR
GypSum is a new deep learning model that learns hybrid representations using graph attention neural networks and a pre-trained programming and natural language model and demonstrates the superior performance of GypSum over existing code summarization models.
Zero-Shot Program Representation Learning
TLDR
Zecoler is a zero-shot learning approach for code representations built upon a pre-trained programming language model that significantly outperforms baseline models in both zero- shot and few-shot settings.
StructCoder: Structure-Aware Transformer for Code Generation
TLDR
This work develops an encoder-decoder Transformer model where both the encoder and decoder are trained to recognize the syntax and data flow in the source and target codes, respectively, and achieves state-of-the-art performance on code translation and text-to-code generation tasks in the CodeXGLUE benchmark.
Cross-Domain Deep Code Search with Meta Learning
TLDR
Experimental results show that CDCS significantly outperforms conventional pre-trained code models that are directly fine-tuned in domain-specific languages, and it is particularly effective for scarce data.
Probing Pretrained Models of Source Code
TLDR
It is shown that pretrained models of code indeed contain information about code syntactic structure and correctness, the notion of namespaces, code readability and natural language naming, but lack understanding of code semantics.
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning
TLDR
This work proposes “CodeRL”, a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL), and treats the code-generating LM as an actor network, and introduces a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor.
CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation
TLDR
This paper investigates how to leverage an unlabelled code corpus to train a model for library-oriented code generation, and observes that library- oriented code snippets are more likely to share similar code sketches.
Understanding Long Programming Languages with Structure-Aware Sparse Attention
TLDR
This paper presents SASA, a Structure-Aware Sparse Attention mechanism, which reduces the complexity and improves performance for long code understanding tasks, and introduces AST structures into attention.
NatGen: Generative pre-training by "Naturalizing" source code
TLDR
This paper proposes a new pre-training objective, “Naturalizing” of source code, exploiting code’s bimodal, dual-channel (formal & natural channels) nature, and introduces six classes of semantic preserving transformations to introduce un-natural forms of code, and forces the model to produce more natural original programs written by developers.
...
...

References

SHOWING 1-10 OF 40 REFERENCES
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
TLDR
This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.
Unified Pre-training for Program Understanding and Generation
TLDR
Analysis reveals that PLBART learns program syntax, style, logical flow, and style that are crucial to program semantics and thus excels even with limited annotations, and outperforms or rivals state-of-the-art models.
Multi-task Learning based Pre-trained Language Model for Code Completion
TLDR
A multi-task learning based pre-trained language model for code understanding and code generation with a Transformer-based neural architecture that adopts multi- task learning to predict the token and its type jointly and utilize the predicted type to assist the token prediction.
MASS: Masked Sequence to Sequence Pre-training for Language Generation
TLDR
This work proposes MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tasks, which achieves the state-of-the-art accuracy on the unsupervised English-French translation, even beating the early attention-based supervised model.
Studying the Usage of Text-To-Text Transfer Transformer to Support Code-Related Tasks
TLDR
This paper empirically investigated how the T5 model performs when pre-trained and fine-tuned to support code-related tasks, and compared the performance of this single model with the results reported in the four original papers proposing DL-based solutions for those four tasks.
Learning and Evaluating Contextual Embedding of Source Code
TLDR
This paper curates a massive, deduplicated corpus of 7.4M Python files from GitHub, and creates an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples.
GraphCodeBERT: Pre-training Code Representations with Data Flow
TLDR
Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks and it is shown that the model prefers structure-level attentions over token- level attentions in the task of code search.
DOBF: A Deobfuscation Pre-Training Objective for Programming Languages
TLDR
A new pre-training objective, DOBF, is introduced that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code and shows that models pre-trained with DOBF outperform existing approaches on multiple downstream tasks.
Unified Language Model Pre-training for Natural Language Understanding and Generation
TLDR
A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
...
...