Contextualized Code Representation Learning for Commit Message Generation

  title={Contextualized Code Representation Learning for Commit Message Generation},
  author={Lun Yiu Nie and Cuiyun Gao and Zhicong Zhong and Wai Lam and Yang Liu and Zenglin Xu},

A large-scale empirical study of commit message generation: models, datasets and evaluation

This paper conducts a systematic and in-depth analysis of the state-of-the-art models and datasets for automatic commit message generation and collects a large-scale, information-rich, multi-programming-language, MCMD.

Dynamically Relative Position Encoding-Based Transformer for Automatic Code Edit

DTrans is designed with dynamically relative position encoding in the multi-head attention of Transformer, which can more accurately generate patches than the state-of-the-art methods and locate the lines to change with higher accuracy than the existing methods.

FIRA: Fine-Grained Graph-Based Code Change Representation for Automated Commit Message Generation

A novel commit message generation technique, FIRA, which first represents code changes via fine-grained graphs and then learns to generate commit messages automati-cally, which outperforms state-of-the-art techniques in terms of BLEU, ROUGE-L, and METEOR.

RACE: Retrieval-Augmented Commit Message Generation

RACE is proposed, a new retrieval-augmented neural commit message generation method, which treats the retrieved similar commit as an exemplar and leverages it to generate an accurate commit message.

What Makes a Good Commit Message?

A taxonomy based on recurring patterns in commit messages' expressions is developed, investigating whether “good” commit messages can be automatically identified and whether such automation could prompt developers to write better commit messages.

Code Structure Guided Transformer for Source Code Summarization

This paper proposes a novel approach named SG-Trans to incorporate code structural properties into Transformer, which injects the local symbolic information and global syntactic structure into the self-attention module of Transformer as inductive bias to capture the hierarchical characteristics of code.

Jointly Learning to Repair Code and Generate Commit Message

This work proposes a joint model that can both repair the program code and generate the commit message in a unified framework and enhances the cascaded method with different training approaches, including the teacher-student method, the multi-task method, and the back-translation method.

Disentangled Code Representation Learning for Multiple Programming Languages

The experimental results validate the superiority of the proposed disentangled code representation learning approach, compared to several baselines, across three types of downstream tasks, i.e., code clone detection, code translation, and code-to-code search.

CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model

The work is to develop a model that automatically writes the commit message, and releases 345K datasets consisting of code modification and commit messages in six programming languages.



Generating Commit Messages from Diffs using Pointer-Generator Network

PtrGNCMsg, a novel approach which is based on an improved sequence-to-sequence model with the pointer-generator network to translate code diffs into commit messages outperforms recent approaches based on neural machine translation, and first enables the prediction of OOV words.

Neural-Machine-Translation-Based Commit Message Generation: How Far Are We?

A simpler and faster approach is proposed, named NNGen (Nearest Neighbor Generator), to generate concise commit messages using the nearest neighbor algorithm, which is over 2,600 times faster than NMT, and outperforms NMT in terms of BLEU by 21%.

Automatically generating commit messages from diffs using neural machine translation

This paper adapts Neural Machine Translation (NMT) to automatically "translate" diffs into commit messages and designed a quality-assurance filter to detect cases in which the algorithm is unable to produce good messages, and return a warning instead.

A Transformer-based Approach for Source Code Summarization

This work explores the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies in source code summarization, and shows that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.

SCELMo: Source Code Embeddings from Language Models

It is shown that even a low-dimensional embedding trained on a relatively small corpus of programs can improve a state-of-the-art machine learning system for bug detection.

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.

Incorporating BERT into Neural Machine Translation

A new algorithm named BERT-fused model is proposed, in which BERT is first used to extract representations for an input sequence, and then the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms.

Commit Message Generation for Source Code Changes

This paper first extracts both code structure and code semantics from the source code changes, and then jointly model these two sources of information so as to better learn the representations of the code changes.

A Novel Neural Source Code Representation Based on Abstract Syntax Tree

This paper proposes a novel AST-based Neural Network (ASTNN) for source code representation that splits each large AST into a sequence of small statement trees, and encodes the statement trees to vectors by capturing the lexical and syntactical knowledge of statements.

MASS: Masked Sequence to Sequence Pre-training for Language Generation

This work proposes MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tasks, which achieves the state-of-the-art accuracy on the unsupervised English-French translation, even beating the early attention-based supervised model.