Corpus ID: 235755472

Evaluating Large Language Models Trained on Code

  title={Evaluating Large Language Models Trained on Code},
  author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde and J. Kaplan and Harrison Edwards and Yura Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mikhail Pavlov and Alethea. Power and Lukasz Kaiser and Mohammad Bavarian and Clemens Winter and Philippe Tillet and F. Such and D. Cummings and Matthias Plappert and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss and William H. Guss and Alex Nichol and I. Babuschkin and S. Balaji and Shantanu Jain and A. Carr and J. Leike and Joshua Achiam and Vedant Misra and Evan Morikawa and Alec Radford and M. Knight and Miles Brundage and Mira Murati and Katie Mayer and P. Welinder and Bob McGrew and Dario Amodei and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
We introduce Codex, a GPT language model finetuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective… Expand
Program Synthesis with Large Language Models
This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with betweenExpand
What do pre-trained code models know about code?
Pre-trained models of code built on the transformer architecture have performed well on software engineering (SE) tasks such as predictive code generation, code summarization, among others. However,Expand
An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions
There is burgeoning interest in designing AI-based systems to assist humans in designing computing systems, including tools that automatically generate computer code. The most notable of these comesExpand
Natural Language-guided Programming
The key idea is to adapt code autocompletion tools such that they take into account not only the developer’s already-written code but also the intent of the task the developer is trying to achieve next, formulated in plain natural language. Expand
Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks
Project CodeNet is a first-of-its-kind, very large scale, diverse, and high-quality dataset to accelerate the algorithmic advancements in AI for Code, which consists of 14M code samples and about 500M lines of code in 55 different programming languages. Expand
Learning to Synthesize Programs as Interpretable and Generalizable Policies
Recently, deep reinforcement learning (DRL) methods have achieved impressive performance on tasks in a variety of domains. However, neural network policies produced with DRL methods are notExpand
Multi-modal Program Inference: a Marriage of Pre-trainedLanguage Models and Component-based Synthesis
Multi-modal program synthesis refers to the task of synthesizing programs (code) from their specification given in different forms, such as a combination of natural language and examples. ExamplesExpand
TruthfulQA: Measuring How Models Mimic Human Falsehoods
We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law,Expand
Towards Universality in Multilingual Text Rewriting
In this work, we take the first steps towards building a universal rewriter: a model capable of rewriting text in any language to exhibit a wide variety of attributes, including styles and languages,Expand
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
  • Zhengyuan Yang, Zhe Gan, +4 authors Lijuan Wang
  • Computer Science
  • ArXiv
  • 2021
Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from externalExpand


Measuring Coding Challenge Competence With APPS
APPS, a benchmark for code generation that measures the ability of models to take an arbitrary natural language specification and generate Python code fulfilling this specification, and finds that machine learning models are beginning to learn how to code. Expand
In-IDE Code Generation from Natural Language: Promise and Challenges
This paper develops a plugin for the IDE that implements a hybrid of code generation and code retrieval functionality, and orchestrate virtual environments to enable collection of many user events and identifies several pain points that could improve the effectiveness of future machine learning based code generation/retrieval developer assistants. Expand
PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers
This work introduces PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. Expand
code2seq: Generating Sequences from Structured Representations of Code
This model represents a code snippet as the set of compositional paths in its abstract syntax tree and uses attention to select the relevant paths while decoding and significantly outperforms previous models that were specifically designed for programming languages, as well as state-of-the-art NMT models. Expand
An Empirical Study on Learning Bug-Fixing Patches in the Wild via Neural Machine Translation
An empirical study to assess the feasibility of using Neural Machine Translation techniques for learning bug-fixing patches for real defects finds that such a model is able to fix thousands of unique buggy methods in the wild. Expand
Structured Generative Models of Natural Source Code
A family of generative models for NSC that have three key properties: first, they incorporate both sequential and hierarchical structure, second, they learn a distributed representation of source code elements, and third, they integrate closely with a compiler. Expand
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
This work introduces a new automatic evaluation metric, dubbed CodeBLEU, which absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow and can achieve a better correlation with programmer assigned scores compared with BLEu and accuracy. Expand
A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation
A large and diverse parallel corpus of a hundred thousands Python functions with their documentation strings (“docstrings”) generated by scraping open source repositories on GitHub is introduced. Expand
Latent Predictor Networks for Code Generation
A novel neural network architecture is presented which generates an output sequence conditioned on an arbitrary number of input functions and allows both the choice of conditioning context and the granularity of generation, for example characters or tokens, to be marginalised, thus permitting scalable and effective training. Expand
Unit Test Case Generation with Transformers and Focal Context
This paper proposes AthenaTest, an approach that aims to generate unit test cases by learning from real-world focal methods and developer-written testcases, and adopts a two-step training procedure consisting of denoising pretraining on a large unsupervised Java corpus and supervised finetuning for a downstream translation task of generating unit tests. Expand