UniXcoder: Unified Cross-Modal Pre-training for Code Representation

  title={UniXcoder: Unified Cross-Modal Pre-training for Code Representation},
  author={Daya Guo and Shuai Lu and Nan Duan and Yanlin Wang and Ming Zhou and Jian Yin},
Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference. In this paper, we present UniXcoder, a unified cross-modal pre-trained model for… 
ReACC: A Retrieval-Augmented Code Completion Framework
This work proposes a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval, and adopts a stage-wise training approach that combines a source code retriever and an auto-regressive language model for programming language.
PanGu-Coder: Program Synthesis with Function-Level Language Modeling
A pretrained decoder-only language model adopting the P AN G U - α architecture for text-to-code generation, i.e. the synthesis of programming language solutions given a natural language problem description is presented.
CodeReviewer: Pre-Training for Automating Code Review Activities
This research proposes CodeReviewer, a pre-trained model that utilizes four pre-training tasks tailored specifically for the code review senario, and establishes a high-quality benchmark dataset based on the collected data for these three tasks.
Addressing Leakage in Self-Supervised Contextualized Code Retrieval
This work addresses contextualized code retrieval, the search for code snippets helpful to gaps in a partial input program, and suggests a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets to combat leakage between code clone and defect detection.


CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL.
SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation
This paper proposes SYNCOBERT, a Syntax-guided multi-modal contrastive pre-training approach for better Code representations, and designs two novel pre- training objectives originating from the symbolic and syntactic properties of source code, i.e., Identifier Prediction (IP) and AST Edge Prediction (TEP).
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.
Multi-task Learning based Pre-trained Language Model for Code Completion
A multi-task learning based pre-trained language model for code understanding and code generation with a Transformer-based neural architecture that adopts multi- task learning to predict the token and its type jointly and utilize the predicted type to assist the token prediction.
GraphCodeBERT: Pre-training Code Representations with Data Flow
Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks and it is shown that the model prefers structure-level attentions over token- level attentions in the task of code search.
Pre-trained Contextual Embedding of Source Code
This work curates a massive corpus of Python programs from GitHub to pre-train a BERT model, which is then evaluated on a joint classification, localization and repair task involving prediction of two pointers and shows CuBERT's superiority when fine-tuned with smaller datasets, and over fewer epochs.
Unified Pre-training for Program Understanding and Generation
Analysis reveals that PLBART learns program syntax, style, logical flow, and style that are crucial to program semantics and thus excels even with limited annotations, and outperforms or rivals state-of-the-art models.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
BART is presented, a denoising autoencoder for pretraining sequence-to-sequence models, which matches the performance of RoBERTa on GLUE and SQuAD, and achieves new state-of-the-art results on a range of abstractive dialogue, question answering, and summarization tasks.