Corpus ID: 237142385

Program Synthesis with Large Language Models

  title={Program Synthesis with Large Language Models},
  author={Jacob Austin and Augustus Odena and Maxwell Nye and Maarten Bosma and Henryk Michalewski and David Dohan and Ellen Jiang and Carrie Cai and Michael Terry and Quoc V. Le and Charles Sutton},
This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems… Expand
Optimal Neural Program Synthesis from Multimodal Specifications
The experimental results on a multimodal synthesis dataset show that the proposed optimal neural synthesis approach substantially outperforms prior state-of-the-art techniques in terms of accuracy %, finds model-optimal programs more frequently, and explores fewer states during search. Expand
Finetuned Language Models Are Zero-Shot Learners
It is shown that instruction tuning—finetuning language models on a collection of datasets described via instructions—substantially boosts zeroshot performance on unseen tasks and FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 datasets that are evaluated. Expand
Towards A Measure Of General Machine Intelligence
A common language of instruction is proposed, i.e. a programming language that allows the expression of programs in the form of directed acyclic graphs across a wide variety of real-world domains and computing platforms, and a match-based method is demonstrated to both score performance and calculate the generalization difficulty of any given set of tasks. Expand
Learning to Synthesize Programs as Interpretable and Generalizable Policies
Experimental results demonstrate that the proposed framework not only learns to reliably synthesize task-solving programs but also outperforms DRL and program synthesis baselines while producing interpretable and more generalizable policies. Expand
An Empirical Cybersecurity Evaluation of GitHub Copilot's Code Contributions
This work systematically investigates the prevalence and conditions that can cause GitHub Copilot to recommend insecure code, and explores Copilot’s performance on three distinct code generation axes—examining how it performs given diversity of weaknesses, diversity of prompts, and diversity of domains. Expand
Unsolved Problems in ML Safety
A new roadmap for ML Safety is provided and four problems ready for research are presented, namely withstanding hazards, withstanding hazard motivation, identifying hazards, steering ML systems, and reducing risks to how ML systems are handled. Expand
GenLine and GenForm: Two Tools for Interacting with Generative Language Models in a Code Editor
A large, generative language model’s output can be influenced through well-designed prompts, or text-based inputs that establish textual patterns that the model replicates in its output [6]. TheseExpand


Evaluating Large Language Models Trained on Code
It is found that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts, and the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics are discussed. Expand
BUSTLE: Bottom-up program-Synthesis Through Learning-guided Exploration
A new synthesis approach that leverages learning to guide a bottom-up search over programs, and trains a model to prioritize compositions of intermediate values during search conditioned on a given set of input-output examples. Expand
Execution-Guided Neural Program Synthesis
This work proposes two simple yet principled techniques to better leverage the semantic information, which are execution-guided synthesis and synthesizer ensemble that are general enough to be combined with any existing encoder-decoder-style neural program synthesizer. Expand
Neural Sketch Learning for Conditional Program Generation
This work trains a neural generator not on code but on program sketches, or models of program syntax that abstract out names and operations that do not generalize across programs, and shows that it can often predict the entire body of a method given just a few API calls or data types that appear in the method. Expand
Measuring Coding Challenge Competence With APPS
APPS, a benchmark for code generation that measures the ability of models to take an arbitrary natural language specification and generate Python code fulfilling this specification, and finds that machine learning models are beginning to learn how to code. Expand
Code completion with statistical language models
The main idea is to reduce the problem of code completion to a natural-language processing problem of predicting probabilities of sentences, and design a simple and scalable static analysis that extracts sequences of method calls from a large codebase, and index these into a statistical language model. Expand
PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers
This work introduces PyMT5, the Python method text-to-text transfer transformer, which is trained to translate between all pairs of Python method feature combinations: a single model that can both predict whole methods from natural language documentation strings (docstrings) and summarize code into docstrings of any common style. Expand
Automatic Program Synthesis of Long Programs with a Learned Garbage Collector
Using this method, the problem of generating automatic code given sample input-output pairs is considered, and programs that are more than twice as long as existing state-of-the-art solutions are created while improving the success rate for comparable lengths, and cutting the run-time by two orders of magnitude. Expand
A large-scale benchmark for few-shot program induction and synthesis
This work proposes a new way of leveraging a collection of programs with associated unit tests to create a much larger collection of testprogram pairs by extracting subprograms of each program and using the inputs of the overall program to get tests for each subprogram. Expand
Write, Execute, Assess: Program Synthesis with a REPL
We present a neural program synthesis approach integrating components which write, execute, and assess code to navigate the search space of possible programs. We equip the search process with anExpand