• Corpus ID: 235755472

Evaluating Large Language Models Trained on Code

@article{Chen2021EvaluatingLL,
  title={Evaluating Large Language Models Trained on Code},
  author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde and Jared Kaplan and Harrison Edwards and Yura Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser and Mohammad Bavarian and Clemens Winter and Philippe Tillet and Felipe Petroski Such and David W. Cummings and Matthias Plappert and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss and William H. Guss and Alex Nichol and Igor Babuschkin and S. Arun Balaji and Shantanu Jain and Andrew Carr and Jan Leike and Joshua Achiam and Vedant Misra and Evan Morikawa and Alec Radford and Matthew M. Knight and Miles Brundage and Mira Murati and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.03374}
}
We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Fur-thermore, we find that repeated sampling from the model is a surprisingly effective… 
A systematic evaluation of large language models of code
TLDR
This work finds that existing opensource models do achieve close results in some programming languages, although targeted mainly for natural language modeling, and identifies an important missing piece in the form of a large open-source model trained exclusively on a multi-lingual corpus of code.
Measuring Coding Challenge Competence With APPS
TLDR
APPS is introduced, a benchmark for code generation that measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code and shows that machine learning models are now beginning to learn how to code.
CodeT: Code Generation with Generated Tests
TLDR
This paper explores the use of pre-trained language models to automatically generate test cases, calling it CODET: CODE generation with generated Tests, and then chooses the best solution based on a dual execution agreement with both the generated test cases and other generated solutions.
Pop Quiz! Can a Large Language Model Help With Reverse Engineering?
TLDR
An extensive quantitative analysis of the measured performance of the language model on a set of program purpose identification and information extraction tasks shows that LLMs are not yet ready for zero-shot reverse engineering.
Repository-Level Prompt Generation for Large Language Models of Code
TLDR
This work proposes a framework called Repo-Level Prompt Generator that learns to generate example-specific prompts using a set of rules that take context from the entire repository, thereby incorporating both the structure of the repository and the context from other relevant files.
MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages
TLDR
A multilingual dataset, MCoNaLa, is proposed to benchmark code generation from natural language commands extending beyond English, and a quantitative evaluation of performance on the M coNaLa dataset is presented by testing with state-of-theart code generation systems.
Natural Language to Code Translation with Execution
TLDR
This work introduces execution result– based minimum Bayes risk decoding (MBR-EXEC) for program selection and shows that it improves the few-shot performance of pretrained code models on natural-language-to-code tasks, suggesting it as an effective approach for natural language to code translation.
Program Synthesis with Large Language Models
TLDR
The limits of the current generation of large language models for program synthesis in general purpose programming languages are explored, finding that even the best models are generally unable to predict the output of a program given a specific input.
Is GitHub's Copilot as Bad As Humans at Introducing Vulnerabilities in Code?
TLDR
It is investigated whether Copilot is just as likely to introduce the same software vulnerabilities as human developers, as well as other tools, such as Copilot, that are built with language models.
SUPEROPTIMIZE REAL-WORLD PROGRAMS
  • 2022
...
...

References

SHOWING 1-10 OF 124 REFERENCES
Measuring Coding Challenge Competence With APPS
TLDR
APPS is introduced, a benchmark for code generation that measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code and shows that machine learning models are now beginning to learn how to code.
In-IDE Code Generation from Natural Language: Promise and Challenges
TLDR
This article develops a plugin for the PyCharm IDE that implements a hybrid of code generation and code retrieval functionality and asks developers with various backgrounds to complete 7 varieties of 14 Python programming tasks ranging from basic file manipulation to machine learning or data visualization, with or without the help of the plugin.
PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers
TLDR
This work introduces PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style.
code2seq: Generating Sequences from Structured Representations of Code
TLDR
This model represents a code snippet as the set of compositional paths in its abstract syntax tree and uses attention to select the relevant paths while decoding and significantly outperforms previous models that were specifically designed for programming languages, as well as state-of-the-art NMT models.
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
TLDR
This work develops CodeBERT with Transformer-based neural architecture, and trains it with a hybrid objective function that incorporates the pre-training task of replaced token detection, which is to detect plausible alternatives sampled from generators.
Contrastive Code Representation Learning
TLDR
Contracode is proposed: a contrastive pre-training task that learns code functionality, not form, and improves summarization and TypeScript type inference accuracy by 2 to 13 percentage points over competitive baselines.
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
TLDR
This work introduces a new automatic evaluation metric, dubbed CodeBLEU, which absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow and can achieve a better correlation with programmer assigned scores compared with BLEu and accuracy.
An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation
TLDR
An empirical study to assess the feasibility of using Neural Machine Translation techniques for learning bug-fixing patches for real defects finds that such a model is able to fix thousands of unique buggy methods in the wild.
Unsupervised Translation of Programming Languages
TLDR
A fully unsupervised neural transcompiler that relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages is proposed.
Generating bug-fixes using pretrained transformers
TLDR
This work introduces DeepDebug: a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub repositories, and frames bug-patching as a sequence-to-sequence learning task consisting of two steps: denoising pretraining and supervised finetuning on the target translation task.
...
...