Lila: A Unified Benchmark for Mathematical Reasoning

  title={Lila: A Unified Benchmark for Mathematical Reasoning},
  author={Swaroop Mishra and Matthew Finlayson and Pan Lu and Leonard Tang and Sean Welleck and Chitta Baral and Tanmay Rajpurohit and Oyvind Tafjord and Ashish Sabharwal and Peter Clark and A. Kalyan},
Mathematical reasoning skills are essential for general-purpose intelli-gent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose L¯ila , a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language… 

Logical Tasks for Measuring Extrapolation and Rule Comprehension

This work describes and characterize logical tasks and discusses system requirements for their solution, and discusses the relevance of logical tasks to concepts such as extrapolation, explainability, and inductive bias.

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

This work designs language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering S CIENCE QA questions and explores the upper bound of GPT-3 and shows that CoT helps language models learn from fewer data.

Generating Sequences by Learning to Self-Correct

SELF - CORRECTION is presented, an approach that decouples an imperfect base generator from a separate corrector that learns to iteratively correct imperfect generations and improves upon the base generator in three diverse generation tasks– mathematical program synthesis, lexically-constrained generation, and toxicity control.


  • Computer Science
  • 2022
Self-correction provides a flexible framework for improving the performance of off-the-shelf and fine-tuned language models on a wide range of tasks by decomposing generation into a base generator

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Under both few-shot and zero-shot settings, PoT can show an average performance gain over CoT by around 12% across all the evaluated datasets, and by combining PoT with self-consistency decoding, can achieve SoT performance on all math problem datasets and near-SoTA performance on financial datasets.



Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

This work argues for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering, and classify these tasks into skill sets so that researchers can identify (and then rectify) the failings of their systems.

NaturalProofs: Mathematical Theorem Proving in Natural Language

NATHURALPROOFS is developed, a multi-domain corpus of mathematical statements and their proofs, written in natural mathematical language that unifies broad coverage, deep coverage, and low-resource mathematical sources, allowing for evaluating both in-distribution and zero-shot generalization.

Symbolic Brittleness in Sequence Models: on Systematic Generalization in Symbolic Mathematics

A methodology for evaluating generalization that takes advantage of the problem domain's structure and access to a verifier is developed, and the problem of symbolic mathematical integration is considered, as it requires generalizing systematically beyond the training set.

Training Verifiers to Solve Math Word Problems

It is demonstrated that verification significantly improves performance on GSM8K, and there is strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

A new benchmark styled after GLUE is presented, a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard are presented.

INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving

This paper introduces INT, an INequality Theorem proving benchmark, specifically designed to test agents' generalization ability, and evaluates the same agents augmented with Monte Carlo Tree Search at test time, and shows that MCTS can help to prove new theorems.

Taskonomy: Disentangling Task Transfer Learning

This work proposes a fully computational approach for modeling the structure of space of visual tasks via finding (first and higher-order) transfer learning dependencies across a dictionary of twenty six 2D, 2.5D, 3D, and semantic tasks in a latent space and provides a computational taxonomic map for task transfer learning.

Finetuned Language Models Are Zero-Shot Learners

It is shown that instruction tuning —finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks and outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze.

Learning from Explicit and Implicit Supervision Jointly For Algebra Word Problems

This paper demonstrates that it is possible to efficiently mine algebra problems and their numerical solutions with little to no manual effort and proposes a novel structured-output learning algorithm that aims to learn from both explicit and implicit supervision signals jointly.

DRAW: A Challenging and Diverse Algebra Word Problem Set

A quantitative comparison of DRAW to existing benchmarks is presented, showing that DRAW consists a wide variety of problems, both in terms of narrative diversity and problem types, and a strong baseline for DRAW is provided using a simple yet powerful solver.