UniXcoder: Unified Cross-Modal Pre-training for Code Representation

@inproceedings{Guo2022UniXcoderUC,
  title={UniXcoder: Unified Cross-Modal Pre-training for Code Representation},
  author={Daya Guo and Shuai Lu and Nan Duan and Yanlin Wang and Ming Zhou and Jian Yin},
  booktitle={ACL},
  year={2022}
}
Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference. In this paper, we present UniXcoder, a unified cross-modal pre-trained model for… 
ReACC: A Retrieval-Augmented Code Completion Framework
TLDR
This work proposes a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval, and adopts a stage-wise training approach that combines a source code retriever and an auto-regressive language model for programming language.
CodeReviewer: Pre-Training for Automating Code Review Activities
TLDR
This research proposes CodeReviewer, a pre-trained model that utilizes four pre-training tasks tailored specifically for the code review senario, and establishes a high-quality benchmark dataset based on the collected data for these three tasks.
Addressing Leakage in Self-Supervised Contextualized Code Retrieval
TLDR
This work addresses contextualized code retrieval, the search for code snippets helpful to gaps in a partial input program, and suggests a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets to combat leakage between code clone and defect detection.

References

SHOWING 1-10 OF 36 REFERENCES
CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation
TLDR
Comprehensive experiments show that CodeT5 significantly outperforms prior methods on understanding tasks such as code defect detection and clone detection, and generation tasks across various directions including PL-NL, NL-PL, and PL-PL.
SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation
TLDR
This paper proposes SYNCOBERT, a Syntax-guided multi-modal contrastive pre-training approach for better Code representations, and designs two novel pre- training objectives originating from the symbolic and syntactic properties of source code, i.e., Identifier Prediction (IP) and AST Edge Prediction (TEP).
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
TLDR
This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.
Multi-task Learning based Pre-trained Language Model for Code Completion
TLDR
A multi-task learning based pre-trained language model for code understanding and code generation with a Transformer-based neural architecture that adopts multi- task learning to predict the token and its type jointly and utilize the predicted type to assist the token prediction.
GraphCodeBERT: Pre-training Code Representations with Data Flow
TLDR
Results show that code structure and newly introduced pre-training tasks can improve GraphCodeBERT and achieves state-of-the-art performance on the four downstream tasks and it is shown that the model prefers structure-level attentions over token- level attentions in the task of code search.
Pre-trained Contextual Embedding of Source Code
TLDR
This work curates a massive corpus of Python programs from GitHub to pre-train a BERT model, which is then evaluated on a joint classification, localization and repair task involving prediction of two pointers and shows CuBERT's superiority when fine-tuned with smaller datasets, and over fewer epochs.
Unified Pre-training for Program Understanding and Generation
TLDR
Analysis reveals that PLBART learns program syntax, style, logical flow, and style that are crucial to program semantics and thus excels even with limited annotations, and outperforms or rivals state-of-the-art models.
Unified Language Model Pre-training for Natural Language Understanding and Generation
TLDR
A new Unified pre-trained Language Model (UniLM) that can be fine-tuned for both natural language understanding and generation tasks that compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
...
...