How much coffee was consumed during EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI

@article{Kalyan2021HowMC,
  title={How much coffee was consumed during EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI},
  author={A. Kalyan and Abhinav Kumar and Arjun Chandrasekaran and Ashish Sabharwal and Peter Clark},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.14207}
}
Many real-world problems require the combined application of multiple reasoning abilities—employing suitable abstractions, commonsense knowledge, and creative synthesis of problem-solving strategies. To help advance AI systems towards such capabilities, we propose a new reasoning challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible. For example, “How much would the sea… 

Figures and Tables from this paper

Inferring Implicit Relations with Language Models

TLDR
This work investigates why current models struggle with implicit reasoning question answering (QA) tasks, by decoupling inference of reasoning steps from their execution, and suggests that the bottleneck for answering implicit reasoning questions is in the ability of language models to retrieve and reason over information rather than to plan an accurate reasoning strategy.

Sub-Task Decomposition Enables Learning in Sequence to Sequence Tasks

TLDR
It is shown that when concatenating intermediate supervision to the input and training a sequence-to-sequence model on this modified input, unlearnable composite problems can become learnable and this is true for any family of tasks which on the one hand, are Unlearnable, and on the other hand, can be decomposed into a polynomial number of simple sub-tasks.

Masked Measurement Prediction: Learning to Jointly Predict Quantities and Units from Textual Context

TLDR
A novel task, Masked Measurement Prediction ( MMP), where a model learns to reconstruct a number together with its associated unit given masked text, is introduced, useful for both training new numerically informed models as well as evaluating numeracy of existing systems.

General-Purpose Question-Answering with Macaw

TLDR
The M ACAW system is described, and a variety of question types where it produces surprisingly good answers are illustrated, well outside the training setup, offering insights into the limitations of pretrained language models.

References

SHOWING 1-10 OF 20 REFERENCES

GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

TLDR
This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks and provides formal granular evaluation metrics and identifies areas for future research.

Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

TLDR
This work introduces StrategyQA, a question answering benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy, and proposes a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts.

Measuring Mathematical Problem Solving With the MATH Dataset

TLDR
This work introduces MATH, a new dataset of 12, 500 challenging competition mathematics problems which can be used to teach models to generate answer derivations and explanations, and shows that accuracy remains relatively low, even with enormous Transformer models.

An Empirical Investigation of Contextualized Number Prediction

TLDR
A suite of output distribution parameterizations are introduced that incorporate latent variables to add expressivity and better fit the natural distribution of numeric values in running text, and combine them with both recur-rent and transformer-based encoder architectures.

Do Language Embeddings capture Scales?

TLDR
This work identifies contextual information in pre-training and numeracy as two key factors affecting their performance, and shows that a simple method of canonicalizing numbers can have a significant effect on the results.

Deep Learning for Symbolic Mathematics

TLDR
It is shown that neural networks can be surprisingly good at more elaborated tasks in mathematics, such as symbolic integration and solving differential equations, and a syntax for representing these mathematical problems, and methods for generating large datasets that can be used to train sequence-to-sequence models.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

TLDR
Sentence-BERT (SBERT), a modification of the pretrained BERT network that use siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity is presented.

COMET: Commonsense Transformers for Automatic Knowledge Graph Construction

TLDR
This investigation reveals promising results when implicit knowledge from deep pre-trained language models is transferred to generate explicit knowledge in commonsense knowledge graphs, and suggests that using generative commonsense models for automatic commonsense KB completion could soon be a plausible alternative to extractive methods.

MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms

TLDR
A large-scale dataset of math word problems and an interpretable neural math problem solver by learning to map problems to their operation programs and a new representation language to model operation programs corresponding to each math problem that aim to improve both the performance and the interpretability of the learned models.