• Corpus ID: 234790100

Measuring Coding Challenge Competence With APPS

@article{Hendrycks2021MeasuringCC,
  title={Measuring Coding Challenge Competence With APPS},
  author={Dan Hendrycks and Steven Basart and Saurav Kadavath and Mantas Mazeika and Akul Arora and Ethan Guo and Collin Burns and Samir Puranik and Horace He and Dawn Xiaodong Song and Jacob Steinhardt},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.09938}
}
While programming is one of the most broadly applicable skills in modern society, it is unclear how well state-of-the-art machine learning models can write code. De-spite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to assess code generation performance in an accurate and rigorous manner. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark… 

Figures and Tables from this paper

Evaluating Large Language Models Trained on Code
TLDR
It is found that repeated sampling from the GPT language model is a surprisingly effective strategy for producing working solutions to difficult prompts, and the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics are discussed.
MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages
TLDR
A multilingual dataset, MCoNaLa, is proposed to benchmark code generation from natural language commands extending beyond English, and a quantitative evaluation of performance on the M coNaLa dataset is presented by testing with state-of-theart code generation systems.
Improving automatically generated code from Codex via Automated Program Repair
TLDR
This study systematically study whether automated program repair (APR) techniques can fix the incorrect solutions produced by language models in LeetCode contests, revealing that automatically generated codes share some common programming mistakes with human-crafted solutions, indicating existing APR tools have the potential to fix auto-generated code.
Natural Language to Code Translation with Execution
TLDR
This work introduces execution result– based minimum Bayes risk decoding (MBR-EXEC) for program selection and shows that it improves the few-shot performance of pretrained code models on natural-language-to-code tasks, suggesting it as an effective approach for natural language to code translation.
Neural Program Generation Modulo Static Analysis
TLDR
The neurosymbolic method allows a deep generative model to symbolically compute, using calls to a static-analysis tool, long-distance semantic relationships in the code that it has already generated, and learns to generate programs conditioned on them.
Program Synthesis with Large Language Models
TLDR
The limits of the current generation of large language models for program synthesis in general purpose programming languages are explored, finding that even the best models are generally unable to predict the output of a program given a specific input.
Fault-Aware Neural Code Rankers
TLDR
Fault-aware neural code rankers are proposed that can predict the correctness of a sampled program without executing it and can significantly increase the pass@1 accuracy of various code generation models on APPS, HumanEval and MBPP datasets.
Less is More: Summary of Long Instructions is Better for Program Synthesis
TLDR
It is shown that superfluous information often present in problem description such as human characters, background stories, names does not help models in understanding a task, and a meta-dataset from the frequently used APPS dataset for the program synthesis task is created.
FixEval: Execution-based Evaluation of Program Fixes for Competitive Programming Problems
TLDR
This work introduces FIXEVAL, a benchmark comprising buggy code submissions to competitive programming problems and their respective fixes, and introduces a rich test suite to evaluate and assess the correctness of model-generated program fixes.
CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation
TLDR
This paper investigates how to leverage an unlabelled code corpus to train a model for library-oriented code generation and presents CERT with two steps: a sketcher generates the sketch, then a generator fills the details in the sketch.
...
...

References

SHOWING 1-10 OF 49 REFERENCES
Evaluating Large Language Models Trained on Code
TLDR
It is found that repeated sampling from the GPT language model is a surprisingly effective strategy for producing working solutions to difficult prompts, and the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics are discussed.
Deep Learning Based Program Generation From Requirements Text: Are We There Yet?
TLDR
A popularity-based approach is proposed that always generates the most popular statements in training programs regardless of the input (software requirements), and Evaluation results suggest that none of the state-of-the-art approaches can outperform this simple statistics- based approach.
Mapping Language to Code in Programmatic Context
TLDR
This work introduces CONCODE, a new large dataset with over 100,000 examples consisting of Java classes from online code repositories, and develops a new encoder-decoder architecture that models the interaction between the method documentation and the class environment.
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis
TLDR
This work introduces a new automatic evaluation metric, dubbed CodeBLEU, which absorbs the strength of BLEU in the n-gram match and further injects code syntax via abstract syntax trees (AST) and code semantics via data-flow and can achieve a better correlation with programmer assigned scores compared with BLEu and accuracy.
Unsupervised Translation of Programming Languages
TLDR
A fully unsupervised neural transcompiler that relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages is proposed.
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
TLDR
This paper introduces CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation that includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison.
Mining source code repositories at massive scale using language modeling
TLDR
This paper builds the first giga-token probabilistic language model of source code, based on 352 million lines of Java, and proposes new metrics that measure the complexity of a code module and the topical centrality of a module to a software project.
Measuring Massive Multitask Language Understanding
TLDR
While most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average, however, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.
Latent Predictor Networks for Code Generation
TLDR
A novel neural network architecture is presented which generates an output sequence conditioned on an arbitrary number of input functions and allows both the choice of conditioning context and the granularity of generation, for example characters or tokens, to be marginalised, thus permitting scalable and effective training.
Generative Language Modeling for Automated Theorem Proving
TLDR
This work presents an automated prover and proof assistant, GPT-f, for the Metamath formalization language, and analyzes its performance, finding new short proofs that were accepted into the mainMetamath library, which is to this knowledge, the first time a deep-learning based system has contributed proofs that are adopted by a formal mathematics community.
...
...