Corpus ID: 237142385

Program Synthesis with Large Language Models

@article{Austin2021ProgramSW,
  title={Program Synthesis with Large Language Models},
  author={Jacob Austin and Augustus Odena and Maxwell Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie Cai and Michael Terry and Quoc V. Le and Charles Sutton},
  journal={ArXiv},
  year={2021},
  volume={abs/2108.07732}
}
This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems… Expand
Programming Puzzles
TLDR
A positive correlation between puzzlesolving performance and coding experience, and between the puzzle difficulty for humans and AI solvers are found, and further improvements on P3 could have a significant impact on many program synthesis areas. Expand
Optimal Neural Program Synthesis from Multimodal Specifications
TLDR
The experimental results on a multimodal synthesis dataset show that the proposed optimal neural synthesis approach substantially outperforms prior state-of-the-art techniques in terms of accuracy %, finds model-optimal programs more frequently, and explores fewer states during search. Expand
Finetuned Language Models Are Zero-Shot Learners
TLDR
It is shown that instruction tuning—finetuning language models on a collection of datasets described via instructions—substantially boosts zeroshot performance on unseen tasks and FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 datasets that are evaluated. Expand
Sparks: Inspiration for Science Writing using Language Models
TLDR
It is found that this system for generating “sparks”, sentences related to a scientific concept intended to inspire writers, are more coherent and diverse than a competitive language model baseline, and approach a human-created gold standard. Expand
Cascaded Fast and Slow Models for Efficient Semantic Code Search
TLDR
An efficient and accurate semantic code search framework with cascaded fast and slow models, in which a fast transformer encoder model is learned to optimize a scalable index for fast retrieval followed by learning a slow classification-based re-ranking model to improve the performance of the top K results from the fast retrieval. Expand
Program Synthesis Guided Reinforcement Learning for Partially Observed Environments
TLDR
This work proposes a new approach, model predictive program synthesis (MPPS), that uses program synthesis to automatically generate the guiding programs for program-guided reinforcement learning without requiring the user to provide a new guiding program for every new task. Expand
Learning to Synthesize Programs as Interpretable and Generalizable Policies
TLDR
Experimental results demonstrate that the proposed framework not only learns to reliably synthesize task-solving programs but also outperforms DRL and program synthesis baselines while producing interpretable and more generalizable policies. Expand
Measuring Mathematical Problem Solving With the MATH Dataset
TLDR
This work introduces MATH, a new dataset of 12, 500 challenging competition mathematics problems which can be used to teach models to generate answer derivations and explanations, and shows that accuracy remains relatively low, even with enormous Transformer models. Expand
An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions
TLDR
This work systematically investigates the prevalence and conditions that can cause GitHub Copilot to recommend insecure code, and explores Copilot’s performance on three distinct code generation axes—examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. Expand
Unsolved Problems in ML Safety
TLDR
A new roadmap for ML Safety is provided and four problems ready for research are presented, namely withstanding hazards, withstanding hazard motivation, identifying hazards, steering ML systems, and reducing hazards in deployment. Expand
...
1
2
...

References

SHOWING 1-10 OF 112 REFERENCES
Evaluating Large Language Models Trained on Code
TLDR
It is found that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts, and the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics are discussed. Expand
BUSTLE: Bottom-up program-Synthesis Through Learning-guided Exploration
TLDR
A new synthesis approach that leverages learning to guide a bottom-up search over programs, and trains a model to prioritize compositions of intermediate values during search conditioned on a given set of input-output examples. Expand
Measuring Coding Challenge Competence With APPS
TLDR
APPS, a benchmark for code generation, measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code and finds that the prevalence of syntax errors is decreasing exponentially as models improve. Expand
Execution-Guided Neural Program Synthesis
TLDR
This work proposes two simple yet principled techniques to better leverage the semantic information, which are execution-guided synthesis and synthesizer ensemble that are general enough to be combined with any existing encoder-decoder-style neural program synthesizer. Expand
Neural Sketch Learning for Conditional Program Generation
TLDR
This work trains a neural generator not on code but on program sketches, or models of program syntax that abstract out names and operations that do not generalize across programs, and shows that it can often predict the entire body of a method given just a few API calls or data types that appear in the method. Expand
Programming Puzzles
TLDR
A positive correlation between puzzlesolving performance and coding experience, and between the puzzle difficulty for humans and AI solvers are found, and further improvements on P3 could have a significant impact on many program synthesis areas. Expand
Code completion with statistical language models
TLDR
The main idea is to reduce the problem of code completion to a natural-language processing problem of predicting probabilities of sentences, and design a simple and scalable static analysis that extracts sequences of method calls from a large codebase, and index these into a statistical language model. Expand
PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers
TLDR
This work introduces PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. Expand
Automatic Program Synthesis of Long Programs with a Learned Garbage Collector
TLDR
Using this method, the problem of generating automatic code given sample input-output pairs is considered, and programs that are more than twice as long as existing state-of-the-art solutions are created while improving the success rate for comparable lengths, and cutting the run-time by two orders of magnitude. Expand
A large-scale benchmark for few-shot program induction and synthesis
TLDR
This work proposes a new way of leveraging a collection of programs with associated unit tests to create a much larger collection of testprogram pairs by extracting subprograms of each program and using the inputs of the overall program to get tests for each subprogram. Expand
...
1
2
3
4
5
...