From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project

@article{Clark2020FromT,
  title={From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project},
  author={Peter Clark and Oren Etzioni and Daniel Khashabi and Tushar Khot and Bhavana Dalvi and Kyle Richardson and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord and Niket Tandon and Sumithra Bhakthavatsalam and Dirk Groeneveld and Michal Guerquin and Michael Schmitz},
  journal={ArXiv},
  year={2020},
  volume={abs/1909.01958}
}
AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even Jeopardy!, but the rich variety of standardized exams has remained a landmark challenge. Even as recently as 2016, the best AI system could achieve merely 59.3 percent on an 8th grade science exam. This article reports success on the Grade 8 New York Regents Science Exam, where for the first time a system scores more than 90 percent on the exam’s nondiagram, multiple choice (NDMC) questions. In addition… 

Figures and Tables from this paper

Humans Keep It One Hundred: an Overview of AI Journey
TLDR
The results of AI Journey, a competition of AI-systems aimed to improve AI performance on knowledge bases, reasoning and text generation, are described, showing different approaches to task understanding and reasoning.
Project Aristo: Towards Machines that Capture and Reason with Science Knowledge
TLDR
This talk will describe the journey of Aristo through various knowledge capture technologies, including acquiring if/then rules, tables, knowledge graphs, and latent neural representations, and speculate on the larger quest towards knowledgable machines that can reason, explain, and discuss.
Challenge Closed-book Science Exam: A Meta-learning Based Question Answering System
TLDR
This work proposes a MetaQA framework, where system 1 is an intuitive meta-classifier and system 2 is a reasoning module, which can efficiently solve science problems by learning from related example questions without relying on external knowledge bases.
AI Journey 2019: School Tests Solving Competition
TLDR
A shared competition task with both automatic evaluation of questions and involvement of human assessors for evaluation of computer generated essays as part of the exam tasks, and the data format as well as complete evaluation pipeline required for such competitions.
What Does My QA Model Know? Devising Controlled Probes Using Expert Knowledge
TLDR
A methodology for automatically building probe datasets from expert knowledge sources, allowing for systematic control and a comprehensive evaluation, and confirms that transformer-based multiple-choice QA models are already predisposed to recognize certain types of structural linguistic knowledge.
ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention
TLDR
This paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language and visual understanding challenges this task entails, and relies on pre-trained transformers, fine-tuning and ensembling.
Finding Old Answers to New Math Questions: The ARQMath Lab at CLEF 2020
TLDR
The ARQMath Lab at CLEF 2020 considers the problem of finding answers to new mathematical questions among posted answers on a community question answering site (Math Stack Exchange), and creates a standard test collection for researchers to use for benchmarking.
Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering
TLDR
This paper augments a general commonsense QA framework with a knowledgeable path generator by extrapolating over existing paths in a KG with a state-of-the-art language model, which learns to connect a pair of entities in text with a dynamic, and potentially novel, multi-hop relational path.
Autoregressive Reasoning over Chains of Facts with Transformers
TLDR
This paper proposes an iterative inference algorithm for multi-hop explanation regeneration, that retrieves relevant factual evidence in the form of text snippets, given a natural language question and its answer, that outperforms the previous state-of-the-art in terms of precision, training time and inference efficiency.
ParsiNLU: A Suite of Language Understanding Challenges for Persian
TLDR
This work introduces ParsiNLU, the first benchmark in Persian language that includes a range of language understanding tasks—reading comprehension, textual entailment, and so on, and presents the first results on state-of-the-art monolingual and multilingual pre-trained language models on this benchmark and compares them with human performance.
...
...

References

SHOWING 1-10 OF 67 REFERENCES
My Computer Is an Honor Student - but How Intelligent Is It? Standardized Tests as a Measure of AI
TLDR
It is argued that machine performance on standardized tests should be a key component of any new measure of AI, because attaining a high level of performance requires solving significant AI problems involving language understanding and world modeling - critical skills for any machine that lays claim to intelligence.
Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions
TLDR
This paper describes an alternative approach that operates at three levels of representation and reasoning: information retrieval, corpus statistics, and simple inference over a semi-automatically constructed knowledge base, to achieve substantially improved results.
Project Halo Update - Progress Toward Digital Aristotle
TLDR
The design and evaluation results for a system called AURA are presented, which enables domain experts in physics, chemistry, and biology to author a knowledge base and that then allows a different set of users to ask novel questions against that knowledge base.
Three open problems in AI
TLDR
Once a computer can read one book and prove that it understands it by answering questions about it correctly, then, in principle, it can read all the books that have ever been written.
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
TLDR
A new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI.
RACE: Large-scale ReAding Comprehension Dataset From Examinations
TLDR
The proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models and the ceiling human performance.
Project Halo: Towards a Digital Aristotle
TLDR
The motivation and longterm goals of Project Halo are presented, and the six-month first phase of the project -- the Halo Pilot -- its KR&R challenge, empirical evaluation, results, and failure analysis are described.
The Most Uncreative Examinee: A First Step toward Wide Coverage Natural Language Math Problem Solving
TLDR
Evaluation on entrance exam mock tests revealed that an optimistic estimate of the system’s performance already matches human averages on a few test sets, and a prototype system that accepts as its input a linguistically annotated problem text was developed.
Building Watson: An Overview of the DeepQA Project
TLDR
The results strongly suggest that DeepQA is an effective and extensible architecture that may be used as a foundation for combining, deploying, evaluating and advancing a wide range of algorithmic techniques to rapidly advance the field of QA.
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
TLDR
A new kind of question answering dataset, OpenBookQA, modeled after open book exams for assessing human understanding of a subject, and oracle experiments designed to circumvent the knowledge retrieval bottleneck demonstrate the value of both the open book and additional facts.
...
...