Ensemble Models for Neural Source Code Summarization of Subroutines

@article{LeClair2021EnsembleMF,
  title={Ensemble Models for Neural Source Code Summarization of Subroutines},
  author={Alexander LeClair and Aakash Bansal and Collin McMillan},
  journal={2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)},
  year={2021},
  pages={286-297}
}
A source code summary of a subroutine is a brief description of that subroutine. Summaries underpin a majority of documentation consumed by programmers, such as the method summaries in JavaDocs. Source code summarization is the task of writing these summaries. At present, most state-of-the-art approaches for code summarization are neural network-based solutions akin to seq2seq, graph2seq, and other encoder-decoder architectures. The input to the encoder is source code, while the decoder helps… 

Figures from this paper

Meta Learning for Code Summarization

TLDR
This paper shows that three state-of-the-art models for code summarization work well on largely disjoint subsets of a large code base, and proposes three meta-models that select the best candidate summary for a given code segment.

On the Evaluation of Neural Code Summarization

TLDR
A systematic and in-depth analysis of 5 state-of-the-art neural code summarization models on 6 widely used BLEU variants, 4 pre-processing operations and their combinations, and 3 widely used datasets shows that some important factors have a great influence on the model evaluation, especially on the performance of models and the ranking among the models.

HELoC: Hierarchical Contrastive Learning of Source Code Representation

  • Xiao WangQiong Wu Songlin Hu
  • Computer Science
    2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)
  • 2022
TLDR
HELoC, a hierarchical contrastive learning model for source code representation that makes the representation vectors of nodes with greater differences in AST levels farther apart in the embedding space so that the structural similarities between code snippets can be measured more precisely.

CoditT5: Pretraining for Source Code and Natural Language Editing

TLDR
A novel pretraining objective is proposed which explicitly models edits and used to build CoditT5, a large language model for software-related editing tasks that is pretrained on large amounts of source code and natural language comments.

Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization

TLDR
An automated code-comment cleaning tool is proposed that can accurately detect noisy data caused by inappropriate data preprocessing operations from existing benchmark datasets, and removing these noisy data leads to a significant improvement on the performance of code summarization.

Multilingual training for Software Engineering

TLDR
Evidence suggesting that human-written code in different languages, is rather similar, and particularly preserving of identifier naming patterns, is presented, to find evidence that available multilingual training data (across different languages) can be used to amplify performance.

Multilingual training for So ware Engineering

TLDR
Evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns is presented, to find evidence that available multilingual training data (across different languages) can be used to amplify performance.

Automatic Comment Generation via Multi-Pass Deliberation

TLDR
The proposed DECOM is a multi-pass deliberation framework for automatic comment generation that outperforms the state-of-the-art baselines and a human evaluation study confirms the comments generated by DECOM tend to be more readable, informative, and useful.

A Survey on Machine Learning Techniques for Source Code Analysis

semantic graph, CFG Token-based, path-

References

SHOWING 1-10 OF 41 REFERENCES

Project-Level Encoding for Neural Source Code Summarization of Subroutines

TLDR
This paper presents a project-level encoder to improve models of code summarization, and demonstrates how the encoder improves several existing models, and provides guidelines for maximizing improvement while controlling time and resource costs in model size.

Summarizing Source Code using a Neural Attention Model

TLDR
This paper presents the first completely datadriven approach for generating high level summaries of source code, which uses Long Short Term Memory (LSTM) networks with attention to produce sentences that describe C# code snippets and SQL queries.

A Transformer-based Approach for Source Code Summarization

TLDR
This work explores the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies in source code summarization, and shows that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.

Improved Code Summarization via a Graph Neural Network

TLDR
This paper presents an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries, and shows improvement over four baseline techniques.

A Neural Model for Generating Natural Language Summaries of Program Subroutines

TLDR
This paper presents a neural model that combines words from code with code structure from an AST, which allows the model to learn code structure independent of the text in code.

Retrieval-Augmented Generation for Code Summarization via Hybrid GNN

TLDR
A novel retrieval-augmented mechanism to combine the benefits of both worlds in source code summarization, and proposes a novel attention-based dynamic graph to complement the static graph representation of the source code, and design a hybrid message passing GNN for capturing both the local and global structural information.

code2seq: Generating Sequences from Structured Representations of Code

TLDR
This model represents a code snippet as the set of compositional paths in its abstract syntax tree and uses attention to select the relevant paths while decoding and significantly outperforms previous models that were specifically designed for programming languages, as well as state-of-the-art NMT models.

Improved Automatic Summarization of Subroutines via Attention to File Context

TLDR
This paper presents an approach that models the file context of subroutines and uses an attention mechanism to find words and concepts to use in summaries and shows in an experiment that this approach extends and improves several recent baselines.

Recommendations for Datasets for Source Code Summarization

TLDR
A dataset based on prior work of over 2.1m pairs of Java methods and one sentence method descriptions from over 28k Java projects is released and recommendations for these standards from experimental results are made.

Learning to Generate Comments for API-Based Code Snippets

TLDR
This paper takes API sequences as the core semantic representations of method-level API-based code snippets and generates comments from API sequences with sequence-to-sequence neural models and presents that this approach generates reasonable and effective comments for API- based code snippets.