Corpus ID: 232134851

Measuring Mathematical Problem Solving With the MATH Dataset

@article{Hendrycks2021MeasuringMP,
  title={Measuring Mathematical Problem Solving With the MATH Dataset},
  author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and D. Song and J. Steinhardt},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.03874}
}
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12, 500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large… Expand

Figures and Tables from this paper

Teaching Autoregressive Language Models Complex Tasks By Demonstration
This paper demonstrates that by fine-tuning an autoregressive language model (GPT-Neo[1], [2]) on appropriately structured step-by-step demonstrations, it is possible to teach it to execute aExpand
Measuring Coding Challenge Competence With APPS
TLDR
APPS, a benchmark for code generation that measures the ability of models to take an arbitrary natural language specification and generate Python code fulfilling this specification, and finds that machine learning models are beginning to learn how to code. Expand
NaturalProofs: Mathematical Theorem Proving in Natural Language
TLDR
This work develops NATURALPROOFS, a largescale dataset of mathematical statements and their proofs, written in natural mathematical language, and proposes a mathematical reference retrieval task that tests a system’s ability to determine the key results that appear in a proof. Expand
Solving Machine Learning Problems
TLDR
This work generates a new training set of questions and answers consisting of course exercises, homework, and quiz questions from MIT’s 6.036 Introduction to Machine Learning course and trains a machine learning model to solve machine learning problems from a University undergraduate level course. Expand
MiniF2F: a cross-system benchmark for formal Olympiad-level mathematics
  • Kunhao Zheng, Jesse Michael Han, Stanislas Polu
  • Computer Science
  • ArXiv
  • 2021
We present miniF2F, a dataset of formal Olympiad-level mathematics problems statements intended to provide a unified cross-system benchmark for neural theorem proving. The miniF2F benchmark currentlyExpand
TruthfulQA: Measuring How Models Mimic Human Falsehoods
We propose a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law,Expand
Reasoning with Transformer-based Models: Deep Learning, but Shallow Reasoning
  • 2021
Recent years have seen impressive performance of transformer-based models on different natural language processing tasks. However, it is not clear to what degree the transformers can reason onExpand
Effect of large-scale pre-training on full and few-shot transfer learning for natural and medical images
TLDR
This work conducts largescale pre-training on large source datasets of either natural (ImageNet-21k/1k) or medical chest X-Ray images and compares full and few-shot transfer using different target datasets from both natural and medical imaging domains, indicating possibility to obtain high quality models for domain-specific transfer by pre- training instead on comparably very large, generic source data. Expand
CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review
TLDR
It is found that Transformer models have nascent performance, but that this performance is strongly influenced by model design and training dataset size, so there is still substantial room for improvement. Expand
Representing Numbers in NLP: a Survey and a Vision
TLDR
This work synthesizes best practices for representing numbers in text and articulate a vision for holistic numeracy in NLP, comprised of design trade-offs and a unified evaluation. Expand
...
1
2
...

References

SHOWING 1-10 OF 45 REFERENCES
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
TLDR
A large-scale dataset of math word problems and an interpretable neural math problem solver by learning to map problems to their operation programs and a new representation language to model operation programs corresponding to each math problem that aim to improve both the performance and the interpretability of the learned models. Expand
Analysing Mathematical Reasoning Abilities of Neural Models
TLDR
This paper conducts a comprehensive analysis of models from two broad classes of the most powerful sequence-to-sequence architectures and finds notable differences in their ability to resolve mathematical problems and generalize their knowledge. Expand
Deep Learning for Symbolic Mathematics
TLDR
It is shown that neural networks can be surprisingly good at more elaborated tasks in mathematics, such as symbolic integration and solving differential equations, and a syntax for representing these mathematical problems, and methods for generating large datasets that can be used to train sequence-to-sequence models. Expand
Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems
TLDR
Experimental results show that indirect supervision of program learning via answer rationales is a promising strategy for inducing arithmetic programs. Expand
LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning
TLDR
Inspired by Peirce’s view that deduction, induction, and abduction form an irreducible set of reasoning primitives, three synthetic tasks that are intended to require the model to have these three abilities are designed. Expand
LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning
TLDR
A comprehensive dataset, named LogiQA, is built, which is sourced from expert-written questions for testing human Logical reasoning, and shows that state-of-the-art neural models perform by far worse than human ceiling. Expand
Modelling High-Level Mathematical Reasoning in Mechanised Declarative Proofs
TLDR
A non-synthetic dataset is built from the largest repository of mechanised proofs and a task on causal reasoning, where a model is required to fill in a missing intermediate proposition given a causal context, and a hierarchical transformer model is proposed that outperforms the transformer baseline. Expand
GamePad: A Learning Environment for Theorem Proving
TLDR
A system called GamePad is introduced that can be used to explore the application of machine learning methods to theorem proving in the Coq proof assistant and addresses position evaluation and tactic prediction tasks, which arise naturally in tactic-based theorem proving. Expand
MathZero, The Classification Problem, and Set-Theoretic Type Theory
TLDR
To the knowledge the first isomorphism inference rules for set- theoretic dependent type theory with propositional set-theoretic equality are given, intended to be accessible to mathematicians with no prior exposure to type theory. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
...
1
2
3
4
5
...