Corpus ID: 237562834

CodeQA: A Question Answering Dataset for Source Code Comprehension

  title={CodeQA: A Question Answering Dataset for Source Code Comprehension},
  author={Chenxiao Liu and Xiaojun Wan},
  • Chenxiao Liu, Xiaojun Wan
  • Published 17 September 2021
  • Computer Science
  • ArXiv
We propose CodeQA, a free-form question answering dataset for the purpose of source code comprehension: given a code snippet and a question, a textual answer is required to be generated. CodeQA contains a Java dataset with 119,778 question-answer pairs and a Python dataset with 70,085 question-answer pairs. To obtain natural and faithful questions and answers, we implement syntactic rules and semantic analysis to transform code comments into question-answer pairs. We present the construction… Expand


SQuAD: 100,000+ Questions for Machine Comprehension of Text
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). Expand
Summarizing Source Code using a Neural Attention Model
This paper presents the first completely datadriven approach for generating high level summaries of source code, which uses Long Short Term Memory (LSTM) networks with attention to produce sentences that describe C# code snippets and SQL queries. Expand
Deep Code Comment Generation
DeepCom applies Natural Language Processing (NLP) techniques to learn from a large code corpus and generates comments from learned features for better comments generation of Java methods. Expand
Summarizing Source Code with Transferred API Knowledge
Experiments on large-scale real-world industry Java projects indicate that the proposed novel approach, named TL-CodeSum, is effective and outperforms the state-of-the-art in code summarization. Expand
A Neural Question Answering System for Basic Questions about Subroutines
This paper designs a context-based QA system for basic questions about subroutines based on rules the authors extract from recent empirical studies, and trains a custom neural QA model with this dataset and evaluates the model in a study with professional programmers. Expand
NewsQA: A Machine Comprehension Dataset
NewsQA, a challenging machine comprehension dataset of over 100,000 human-generated question-answer pairs, is presented and analysis confirms that NewsQA demands abilities beyond simple word matching and recognizing textual entailment. Expand
A Transformer-based Approach for Source Code Summarization
This work explores the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies in source code summarization, and shows that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin. Expand
Syn-QG: Syntactic and Shallow Semantic Rules for Question Generation
Question Generation (QG) is fundamentally a simple syntactic transformation; however, many aspects of semantics influence what questions are good to form. We implement this observation by developingExpand
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
This new dataset is aimed to overcome a number of well-known weaknesses of previous publicly available datasets for the same task of reading comprehension and question answering, and is the most comprehensive real-world dataset of its kind in both quantity and quality. Expand
A Neural Model for Generating Natural Language Summaries of Program Subroutines
This paper presents a neural model that combines words from code with code structure from an AST, which allows the model to learn code structure independent of the text in code. Expand