Semantic Similarity Metrics for Evaluating Source Code Summarization

  title={Semantic Similarity Metrics for Evaluating Source Code Summarization},
  author={Sakib Haque and Zachary Eberhart and Aakash Bansal and Collin McMillan},
  journal={2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC)},
Source code summarization involves creating brief descriptions of source code in natural language. These descriptions are a key component of software documentation such as JavaDocs. Automatic code summarization is a prized target of software engineering research, due to the high value summaries have to programmers and the simultaneously high cost of writing and maintaining documentation by hand. Current work is almost all based on machine models trained via big data input. Large datasets of… 

Figures and Tables from this paper


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
ROUGE: A Package for Automatic Evaluation of Summaries
Four different RouGE measures are introduced: ROUGE-N, ROUge-L, R OUGE-W, and ROUAGE-S included in the Rouge summarization evaluation package and their evaluations.
METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments
METEOR is described, an automatic metric for machine translation evaluation that is based on a generalized concept of unigram matching between the machineproduced translation and human-produced reference translations and can be easily extended to include more advanced matching strategies.
Statistics Without Maths For Psychology
1 Variables and research design 2 Introduction to SPSS 3 Descriptive statistics 4 Probability, sampling and distributions 5 Hypothesis testing and statistical significance 6 Correlational analysis:
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
Qualitatively, the proposed RNN Encoder‐Decoder model learns a semantically and syntactically meaningful representation of linguistic phrases.
A Human Study of Comprehension and Code Summarization
A human study involving both university students and professional developers found that participants performed significantly better using human-written summaries versus machine-generated summaries, but found no evidence to support that participants perceive human- and machine- generated summaries to have different qualities.
Reassessing automatic evaluation metrics for code summarization tasks
An empirical study with 226 human annotators indicates that metric improvements of less than 2 points do not guarantee systematic improvements in summarization quality, and are unreliable as proxies of human evaluation.
Action Word Prediction for Neural Source Code Summarization
This paper advocates for a special emphasis on action word prediction as an important stepping stone problem towards better code summarization, and shows the value of the problem for code summaries, explore the performance of current baselines, and provide recommendations for future research.
A Wizard of Oz Study Simulating API Usage Dialogues With a Virtual Assistant
A set of Wizard of Oz experiments are presented to build a dataset for creating a hypothetical virtual assistant for helping programmers use APIs, and a diverse range of interactions are observed that will facilitate the development of dialogue strategies for virtual assistants for API usage.
Code to Comment “Translation”: Data, Metrics, Baselining & Evaluation
It is argued that fairly naive information retrieval methods do well enough at this task to be considered a reasonable baseline, and some suggestions on how the findings might be used in future research in this area are made.