• Corpus ID: 232134851

Measuring Mathematical Problem Solving With the MATH Dataset

@article{Hendrycks2021MeasuringMP,
  title={Measuring Mathematical Problem Solving With the MATH Dataset},
  author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Xiaodong Song and Jacob Steinhardt},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.03874}
}
Many intellectual endeavors require mathematical problem solving, but this skill remains beyond the capabilities of computers. To measure this ability in machine learning models, we introduce MATH, a new dataset of 12 , 500 challenging competition mathematics problems. Each problem in MATH has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. To facilitate future research and increase accuracy on MATH, we also contribute a large… 

Figures and Tables from this paper

Pretrained Language Models are Symbolic Mathematics Solvers too!
TLDR
A sample efficient way of solving the symbolic tasks by first pretraining the transformer model with language translation and then fine-tuning the pretrained transformer model to solve the downstream task of symbolic mathematics.
Learning Methods for Solving Astronomy Course Problems
TLDR
This work trains a specialized machine learning model to solve university undergraduate level Introduction to Astronomy course problems using a Transformer trained on both text and code, namely OpenAI Codex, and introduces the concept of turning questions into programming tasks.
Towards Tractable Mathematical Reasoning: Challenges, Strategies, and Opportunities for Solving Math Word Problems
TLDR
This work inspects non-neural and neural methods to solve math word problems narrated in a natural language, and highlights the ability of these methods to be generalizable, mathematically reasonable, interpretable, and explainable.
NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks
TLDR
NumGLUE is proposed, a multi-task benchmark that evaluates the performance of AI systems on eight different tasks, that at their core require simple arithmetic understanding and it is shown that this benchmark is far from being solved with neural models including state-of-the-art large-scale language models performing significantly worse than humans.
Teaching Autoregressive Language Models Complex Tasks By Demonstration
TLDR
The results suggest that fine-tuning autoregressive language models on small sets of well-crafted demonstrations may be a useful paradigm for enabling individuals without training in machine learning to coax such models to perform some kinds of complex multi-step tasks.
Continual Pre-training of Language Models for Math Problem Understanding with Syntax-Aware Memory Network
TLDR
A new approach to continually pre-train language models for math problem understanding with syntax-aware memory network called COMUS, which can model the interaction between the token from the text and its semantic-related nodes within the formulas, which is helpful to capture fine-grained semantic correlations between texts and formulas.
Formal Mathematics Statement Curriculum Learning
TLDR
It is shown that at same compute budget, expert iteration, by which the authors mean proof search interleaved with learning, dramatically outperforms proof search only and is capable of finding and solving a curriculum of increasingly difficult problems, without the need for associated ground-truth proofs.
Training Verifiers to Solve Math Word Problems
TLDR
It is demonstrated that verification significantly improves performance on GSM8K, and there is strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.
NaturalProofs: Mathematical Theorem Proving in Natural Language
TLDR
NATHURALPROOFS is developed, a multi-domain corpus of mathematical statements and their proofs, written in natural mathematical language that unifies broad coverage, deep coverage, and low-resource mathematical sources, allowing for evaluating both in-distribution and zero-shot generalization.
How much coffee was consumed during EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI
TLDR
Several unsolved AI problems are crystallized into a single, new challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible.
...
...

References

SHOWING 1-10 OF 65 REFERENCES
MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms
TLDR
A large-scale dataset of math word problems and an interpretable neural math problem solver by learning to map problems to their operation programs and a new representation language to model operation programs corresponding to each math problem that aim to improve both the performance and the interpretability of the learned models.
Analysing Mathematical Reasoning Abilities of Neural Models
TLDR
This paper conducts a comprehensive analysis of models from two broad classes of the most powerful sequence-to-sequence architectures and finds notable differences in their ability to resolve mathematical problems and generalize their knowledge.
Measuring Massive Multitask Language Understanding
TLDR
While most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average, however, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.
Mathematical Reasoning via Self-supervised Skip-tree Training
TLDR
It is found that models trained on the skip-tree task show surprisingly strong mathematical reasoning abilities, and outperform modelstrained on standard skip-sequence tasks.
Deep Learning for Symbolic Mathematics
TLDR
It is shown that neural networks can be surprisingly good at more elaborated tasks in mathematics, such as symbolic integration and solving differential equations, and a syntax for representing these mathematical problems, and methods for generating large datasets that can be used to train sequence-to-sequence models.
Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems
TLDR
Experimental results show that indirect supervision of program learning via answer rationales is a promising strategy for inducing arithmetic programs.
LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning
TLDR
A new pre-training methodology called “LIME” (Learning Inductive bias for Mathematical rEasoning) is defined, which significantly outperform vanilla transformers on four very different large mathematical reasoning benchmarks.
LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning
TLDR
A comprehensive dataset, named LogiQA, is built, which is sourced from expert-written questions for testing human Logical reasoning, and shows that state-of-the-art neural models perform by far worse than human ceiling.
Improving Graph Neural Network Representations of Logical Formulae with Subgraph Pooling
TLDR
This work proposes a novel approach for embedding logical formulae that is designed to overcome the representational limitations of prior approaches and achieves state-of-the-art performance on both premise selection and proof step classification.
How well do Computers Solve Math Word Problems? Large-Scale Dataset Construction and Evaluation
TLDR
A large-scale dataset which is more than 9 times the size of previous ones, and contains many more problem types, and is trained to automatically extract problem answers from the answer text provided by CQA users, which significantly reduces human annotation cost.
...
...