• Corpus ID: 231855531

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

  title={CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation},
  author={Shuai Lu and Daya Guo and Shuo Ren and Junjie Huang and Alexey Svyatkovskiy and Ambrosio Blanco and Colin B. Clement and Dawn Drain and Daxin Jiang and Duyu Tang and Ge Li and Lidong Zhou and Linjun Shou and Long Zhou and Michele Tufano and Ming Gong and Ming Zhou and Nan Duan and Neel Sundaresan and Shao Kun Deng and Shengyu Fu and Shujie Liu},
Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers… 
Unified Pre-training for Program Understanding and Generation
Analysis reveals that PLBART learns program syntax, style, logical flow, and style that are crucial to program semantics and thus excels even with limited annotations, and outperforms or rivals state-of-the-art models.
Text2App: A Framework for Creating Android Apps from Text Descriptions
It is demonstrated that Text2App generalizes well to unseen combination of app components and it is capable of handling noisy natural language instructions, and the possibility of creating applications from highly abstract instructions by coupling the system with GPT-3 – a large pretrained language model.
Literature review on vulnerability detection using NLP technology
A brief survey of some recent new documents and technologies in the field of vulnerability detection, such as CodeBERT, and summarizes the previous technologies.
Few-shot training LLMs for project-specific code-summarization
This paper investigates the use few-shot training with the very large GPT (Generative Pre-trained Transformer) Codex model, and finds evidence suggesting that one can significantly surpass state-of-the-art models for code-summarization, leveraging project-specific training.
Fix Bugs with Transformer through a Neural-Symbolic Edit Grammar
This work introduces NSEdit (neural-symbolic edit), a novel Transformer-based code repair method that predicts an editing sequence that can fix the bugs in source code given only the source code that contains bugs.
ReGVD: Revisiting Graph Neural Networks for Vulnerability Detection
This work considers vulnerability detection as an inductive text classification problem and proposes ReGVD, a simple yet effective graph neural network-based model for the problem, which outperforms the existing state-of-the-art models and obtains the highest accuracy on the real-world benchmark dataset from CodeXGLUE for vulnerability detection.
CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning
This work proposes “CodeRL”, a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL), and treats the code-generating LM as an actor network, and introduces a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor.
An extensive study on pre-trained models for program understanding and generation
The first study for natural language-programming language pre-trained model robustness via adversarial attacks is performed and it is found that a simple random attack approach can easily fool the state-of-the-art pre- trained models and thus incur security issues.
XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence
This paper introduces XLCoST, Cross-Lingual Code SnippeT dataset, a new benchmark dataset for cross-lingual code intelligence that contains fine-grained parallel data from 8 languages, and is the largest parallel dataset for source code both in terms of size and the number of languages.
NatGen: Generative pre-training by "Naturalizing" source code
This paper proposes a new pre-training objective, “Naturalizing” of source code, exploiting code’s bimodal, dual-channel (formal & natural channels) nature, and introduces six classes of semantic preserving transformations to introduce un-natural forms of code, and forces the model to produce more natural original programs written by developers.


Pythia: AI-assisted Code Completion System
The architecture of the Pythia system is described, comparisons to frequency-based approach and invocation-based Markov Chain language model are performed, and challenges serving Pythia models on lightweight client devices are discussed.
Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow
A novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks.
Code Generation as a Dual Task of Code Summarization
This paper proposes a dual training framework to train the two tasks simultaneously, and considers the dualities on probability and attention weights, and design corresponding regularization terms to constrain the duality.
Incorporating External Knowledge through Pre-training for Natural Language to Code Generation
Evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
Learning to Represent Programs with Graphs
This work proposes to use graphs to represent both the syntactic and semantic structure of code and use graph-based deep learning methods to learn to reason over program structures, and suggests that these models learn to infer meaningful names and to solve the VarMisuse task in many cases.
Are deep neural networks the best choice for modeling source code?
This work enhances established language modeling approaches to handle the special challenges of modeling source code, such as frequent changes, larger, changing vocabularies, deeply nested scopes, etc, and presents a fast, nested language modeling toolkit specifically designed for software.
Summarizing Source Code using a Neural Attention Model
This paper presents the first completely datadriven approach for generating high level summaries of source code, which uses Long Short Term Memory (LSTM) networks with attention to produce sentences that describe C# code snippets and SQL queries.
A Survey of Machine Learning for Big Code and Naturalness
This article presents a taxonomy based on the underlying design principles of each model and uses it to navigate the literature and discuss cross-cutting and application-specific challenges and opportunities.
Improving Automatic Source Code Summarization via Deep Reinforcement Learning
  • Yao Wan, Zhou Zhao, Philip S. Yu
  • Computer Science
    2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE)
  • 2018
An abstract syntax tree structure as well as sequential content of code snippets into a deep reinforcement learning framework (i.e., actor-critic network) which provides the confidence of predicting the next word according to current state and an advantage reward composed of BLEU metric to train both networks.
Neural Code Comprehension: A Learnable Representation of Code Semantics
A novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks, and shows that even without fine-tuning, a single RNN architecture and fixed inst2vec embeddings outperform specialized approaches for performance prediction and algorithm classification from raw code.