LILA: A Unified Benchmark for Mathematical Reasoning

  title={LILA: A Unified Benchmark for Mathematical Reasoning},
  author={Swaroop Mishra and Matthew Finlayson and Pan Lu and Leonard Tang and Sean Welleck and Chitta Baral and Tanmay Rajpurohit and Oyvind Tafjord and Ashish Sabharwal and Peter Clark and A. Kalyan},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
Mathematical reasoning skills are essential for general-purpose intelligentsystems to perform tasks from grocery shopping to climate modeling.Towards evaluating and improving AI systems in this domain, we proposeLILA, a unified mathematical reasoning benchmark consisting of 23 diversetasks along four dimensions:(i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv… 

Logical Tasks for Measuring Extrapolation and Rule Comprehension

This work describes and characterize logical tasks and discusses system requirements for their solution, and discusses the relevance of logical tasks to concepts such as extrapolation, explainability, and inductive bias.

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Under both few-shot and zero-shot settings, PoT can show an average performance gain over CoT by around 12% across all the evaluated datasets, and by combining PoT with self-consistency decoding, can achieve SoT performance on all math problem datasets and near-SoTA performance on financial datasets.

A Survey of Deep Learning for Mathematical Reasoning

This survey paper reviews the key tasks, datasets, and methods at the intersec-tion of mathematical reasoning and deep learning over the past decade, and evaluates existing benchmarks and methods and discusses future research directions.

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

This work designs language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering S CIENCE QA questions and explores the upper bound of GPT-3 and shows that CoT helps language models learn from fewer data.

Mathematics, word problems, common sense, and artificial intelligence

It is argued that it is not clear whether these kinds of limitations will be important in developing AI technology for pure mathematical research, but that they will beImportant in applications of mathematics, and may well beimportant in developing programs capable of reading and understanding mathematical content written by humans.

Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model

The strengths and weaknesses of different retriever-augmented language models such as REALM, k NN-LM, FiD, ATLAS, and Flan-T5 in reasoning over the selected documents in different tasks are studied.

Generating Sequences by Learning to Self-Correct

SELF - CORRECTION is presented, an approach that decouples an imperfect base generator from a separate corrector that learns to iteratively correct imperfect generations and improves upon the base generator in three diverse generation tasks– mathematical program synthesis, lexically-constrained generation, and toxicity control.

UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression

A large-scale Unified Geometry problem benchmark, UniGeo, is constructed and a unified multi-task Geometric Transformer framework, Geoformer, is presented to tackle calculation and proving problems simultaneously in the form of sequence generation, which finally shows the reasoning ability can be improved on both two tasks by unifying formulation.

Reasoning with Language Model Prompting: A Survey

This paper provides a comprehensive survey of cutting-edge research on reasoning with language model prompting and introduces research works with comparisons and summaries and provides systematic resources to help beginners.


  • Computer Science
  • 2022
Self-correction provides a flexible framework for improving the performance of off-the-shelf and fine-tuned language models on a wide range of tasks by decomposing generation into a base generator



Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

This work argues for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering, and classify these tasks into skill sets so that researchers can identify (and then rectify) the failings of their systems.

Symbolic Brittleness in Sequence Models: on Systematic Generalization in Symbolic Mathematics

A methodology for evaluating generalization that takes advantage of the problem domain's structure and access to a verifier is developed, and the problem of symbolic mathematical integration is considered, as it requires generalizing systematically beyond the training set.

Training Verifiers to Solve Math Word Problems

It is demonstrated that verification significantly improves performance on GSM8K, and there is strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented.

INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

This paper introduces INT, an INequality Theorem proving benchmark, specifically designed to test agents' generalization ability, and evaluates the same agents augmented with Monte Carlo Tree Search at test time, and shows that MCTS can help to prove new theorems.

Taskonomy: Disentangling Task Transfer Learning

This work proposes a fully computational approach for modeling the structure of space of visual tasks via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space and provides a computational taxonomic map for task transfer learning.

Finetuned Language Models Are Zero-Shot Learners

It is shown that instruction tuning —finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks and outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.

Learning from Explicit and Implicit Supervision Jointly For Algebra Word Problems

This paper demonstrates that it is possible to efficiently mine algebra problems and their numerical solutions with little to no manual effort and proposes a novel structured-output learning algorithm that aims to learn from both explicit and implicit supervision signals jointly.

DRAW: A Challenging and Diverse Algebra Word Problem Set

A quantitative comparison of DRAW to existing benchmarks is presented, showing that DRAW consists a wide variety of problems, both in terms of narrative diversity and problem types, and a strong baseline for DRAW is provided using a simple yet powerful solver.

Language Models are Few-Shot Learners

GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.