NaturalProver: Grounded Mathematical Proof Generation with Language Models

  title={NaturalProver: Grounded Mathematical Proof Generation with Language Models},
  author={Sean Welleck and Jiacheng Liu and Ximing Lu and Hannaneh Hajishirzi and Yejin Choi},
Theorem proving in natural mathematical language – the mixture of symbolic and natural language used by humans – plays a central role in mathematical advances and education, and tests aspects of reasoning that are core to intelligence. Yet it has remained underexplored with modern generative models. We study large-scale language models on two new generation tasks: suggesting the next step in a mathematical proof, and full proof generation. We develop N ATURAL P ROVER , a language model that… 

A Survey of Deep Learning for Mathematical Reasoning

This survey paper reviews the key tasks, datasets, and methods at the intersec-tion of mathematical reasoning and deep learning over the past decade, and evaluates existing benchmarks and methods and discusses future research directions.

ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

ROSCOE is presented, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics and can measure semantic consistency, logicality, informativeness, fluency, and factuality — among other traits — by leveraging properties of step-by-step rationales.

Towards a Mathematics Formalisation Assistant using Large Language Models

The abilities of a large language model (Codex) to help with formalisation in the Lean theorem prover are explored, finding that with careful inputdependent prompt selection and postprocessing, Codex is able to formalise short mathematical statements at undergrad level with nearly 75% accuracy for 120 theorem statements.

Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs

This work introduces Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems.

Dynamic Generation of Interpretable Inference Rules in a Neuro-Symbolic Expert System

This novel reasoning engine, N ELLIE, dynamically instantiates interpretable inference rules that capture and score entailment (de)compositions over natural language statements that provide competitive performance on scientific QA datasets requiring structured explanations over multiple facts.

Linear algebra with transformers

It is shown that small transformers can be trained, from examples only, to compute approximate solutions with more than 90% accuracy (over 99% in most cases), which suggests that applications of transformers to mathematics are not limited to symbolic computation, and can cover a broader range ofScientific problems.



GLEU: Automatic Evaluation of Sentence-Level Fluency

An automatic evaluation metric to estimate fluency alone is developed, by examining the use of parser outputs as metrics, and it is shown that they correlate with human judgements of generated text fluency.

NaturalProofs: Mathematical Theorem Proving in Natural Language

NATHURALPROOFS is developed, a multi-domain corpus of mathematical statements and their proofs, written in natural mathematical language that unifies broad coverage, deep coverage, and low-resource mathematical sources, allowing for evaluating both in-distribution and zero-shot generalization.

WebGPT: Browser-assisted question-answering with human feedback

GPT-3 is tuned to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web, and the best model is obtained by using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences.

Retrieval Augmentation Reduces Hallucination in Conversation

This work explores the use of neural-retrieval-in-the-loop architectures recently shown to be effective in open-domain QA for knowledge-grounded dialogue, a task that is arguably more challenging as it requires querying based on complex multi-turn dialogue context and generating conversationally coherent responses.

Evaluating Large Language Models Trained on Code

It is found that repeated sampling from the GPT language model is a surprisingly effective strategy for producing working solutions to difficult prompts, and the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics are discussed.

Memorizing Transformers

It is demonstrated that an approximate kNN lookup into a non-differentiable memory of recent (key, value) pairs improves language modeling across various benchmarks and tasks, including generic webtext, math papers, books, code, as well as formal theorems (Isabelle).

COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics

This paper presents Energy-based Constrained Decoding with Langevin Dynamics (C OLD), a decoding framework which describes constrained generation as specifying constraints through an energy function, then performing differentiable reasoning over the constraints through gradient-based sampling.

Formal Mathematics Statement Curriculum Learning

It is shown that at same compute budget, expert iteration, by which the authors mean proof search interleaved with learning, dramatically outperforms proof search only and is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs.

Competition-level code generation with AlphaCode

AlphaCode is introduced, a system for code generation that achieved an average ranking in the top 54.3% in simulated evaluations on recent programming competitions on the Codeforces platform, marking the first time an artificial intelligence system has performed competitively in programming competitions.

Chain of Thought Prompting Elicits Reasoning in Large Language Models

Experiments on three large language models show that chain-of-thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks.