From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project

@article{Clark2020FromT,
  title={From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project},
  author={Peter Clark and Oren Etzioni and Daniel Khashabi and Tushar Khot and Bhavana Dalvi and Kyle Richardson and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord and Niket Tandon and Sumithra Bhakthavatsalam and Dirk Groeneveld and Michal Guerquin and Michael Schmitz},
  journal={ArXiv},
  year={2020},
  volume={abs/1909.01958}
}
AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even Jeopardy!, but the rich variety of standardized exams has remained a landmark challenge. Even as recently as 2016, the best AI system could achieve merely 59.3 percent on an 8th grade science exam. This article reports success on the Grade 8 New York Regents Science Exam, where for the first time a system scores more than 90 percent on the exam’s nondiagram, multiple choice (NDMC) questions. In addition… Expand

Figures and Tables from this paper

Humans Keep It One Hundred: an Overview of AI Journey
TLDR
The results of AI Journey, a competition of AI-systems aimed to improve AI performance on knowledge bases, reasoning and text generation, are described, showing different approaches to task understanding and reasoning. Expand
Project Aristo: Towards Machines that Capture and Reason with Science Knowledge
TLDR
This talk will describe the journey of Aristo through various knowledge capture technologies, including acquiring if/then rules, tables, knowledge graphs, and latent neural representations, and speculate on the larger quest towards knowledgable machines that can reason, explain, and discuss. Expand
Challenge Closed-book Science Exam: A Meta-learning Based Question Answering System
TLDR
This work proposes a MetaQA framework, where system 1 is an intuitive meta-classifier and system 2 is a reasoning module, which can efficiently solve science problems by learning from related example questions without relying on external knowledge bases. Expand
AI Journey 2019: School Tests Solving Competition
Question answering is a popular complex task in Machine Learning. One challenging type of question answering is developing systems, capable to pass education exam tests. Such tests induce anExpand
ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention
TLDR
This paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language and visual understanding challenges this task entails, and relies on pre-trained transformers, fine-tuning and ensembling. Expand
Finding Old Answers to New Math Questions: The ARQMath Lab at CLEF 2020
TLDR
The ARQMath Lab at CLEF 2020 considers the problem of finding answers to new mathematical questions among posted answers on a community question answering site (Math Stack Exchange), and creates a standard test collection for researchers to use for benchmarking. Expand
Autoregressive Reasoning over Chains of Facts with Transformers
TLDR
This paper proposes an iterative inference algorithm for multi-hop explanation regeneration, that retrieves relevant factual evidence in the form of text snippets, given a natural language question and its answer, that outperforms the previous state-of-the-art in terms of precision, training time and inference efficiency. Expand
A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering
TLDR
To investigate the performance of natural language understanding approaches on statutory reasoning, a dataset is introduced, together with a legal-domain text corpus, and straightforward application of machine reading models exhibits low out-of-the-box performance on questions, whether or not they have been fine-tuned to the legal domain. Expand
PRETRAIN KNOWLEDGE-AWARE LANGUAGE MODELS
How much knowledge do pretrained language models hold? Recent research observed that pretrained transformers are adept at modeling semantics but it is unclear to what degree they grasp humanExpand
Statutory Legal Reasoning: Challenging Natural Language Systems with Understanding Prescriptive Rules
  • 2020
Legislation can be viewed as a body of prescriptive rules expressed in natural language. The application of legislation to facts of a case we refer to as statutory reasoning, where those facts areExpand
...
1
2
3
4
...

References

SHOWING 1-10 OF 69 REFERENCES
My Computer Is an Honor Student - but How Intelligent Is It? Standardized Tests as a Measure of AI
TLDR
It is argued that machine performance on standardized tests should be a key component of any new measure of AI, because attaining a high level of performance requires solving significant AI problems involving language understanding and world modeling - critical skills for any machine that lays claim to intelligence. Expand
Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions
TLDR
This paper evaluates the methods on six years of unseen, unedited exam questions from the NY Regents Science Exam, and shows that the overall system's score is 71.3%, an improvement of 23.8% (absolute) over the MLN-based method described in previous work. Expand
Project Halo Update - Progress Toward Digital Aristotle
TLDR
The design and evaluation results for a system called AURA are presented, which enables domain experts in physics, chemistry, and biology to author a knowledge base and that then allows a different set of users to ask novel questions against that knowledge base. Expand
Three open problems in AI
TLDR
Once a computer can read one book and prove that it understands it by answering questions about it correctly, then, in principle, it can read all the books that have ever been written. Expand
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
TLDR
A new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. Expand
RACE: Large-scale ReAding Comprehension Dataset From Examinations
TLDR
The proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models and the ceiling human performance. Expand
Project Halo: Towards a Digital Aristotle
TLDR
The motivation and longterm goals of Project Halo are presented, and the six-month first phase of the project -- the Halo Pilot -- its KR&R challenge, empirical evaluation, results, and failure analysis are described. Expand
The Most Uncreative Examinee: A First Step toward Wide Coverage Natural Language Math Problem Solving
TLDR
Evaluation on entrance exam mock tests revealed that an optimistic estimate of the system's performance already matches human averages on a few test sets, and a prototype system that accepts as its input a linguistically annotated problem text was developed. Expand
Building Watson: An Overview of the DeepQA Project
TLDR
The results strongly suggest that DeepQA is an effective and extensible architecture that may be used as a foundation for combining, deploying, evaluating and advancing a wide range of algorithmic techniques to rapidly advance the field of QA. Expand
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
TLDR
A new kind of question answering dataset, OpenBookQA, modeled after open book exams for assessing human understanding of a subject, and oracle experiments designed to circumvent the knowledge retrieval bottleneck demonstrate the value of both the open book and additional facts. Expand
...
1
2
3
4
5
...