Share This Author
Injecting Numerical Reasoning Skills into Language Models
This work shows that numerical reasoning is amenable to automatic data generation, and thus one can inject this skill into pre-trained LMs, by generating large amounts of data, and training in a multi-task setup.
Break It Down: A Question Understanding Benchmark
This work introduces a Question Decomposition Meaning Representation (QDMR) for questions, and demonstrates the utility of QDMR by showing that it can be used to improve open-domain question answering on the HotpotQA dataset, and can be deterministically converted to a pseudo-SQL formal language, which can alleviate annotation in semantic parsing applications.
Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets
It is shown that model performance improves when training with annotator identifiers as features, and that models are able to recognize the most productive annotators and that often models do not generalize well to examples from annotators that did not contribute to the training set.
Transformer Feed-Forward Layers Are Key-Value Memories
This work shows that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary.
Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies
- Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, D. Roth, Jonathan Berant
- Computer ScienceTransactions of the Association for Computational…
- 6 January 2021
This work introduces StrategyQA, a question answering benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy, and proposes a data collection procedure that combines term-based priming to inspire annotators, careful control over the annotator population, and adversarial filtering for eliminating reasoning shortcuts.
DiscoFuse: A Large-Scale Dataset for Discourse-Based Sentence Fusion
A method for automatically-generating fusion examples from raw text and a sequence-to-sequence model on DiscoFuse, a large scale dataset for discourse-based sentence fusion, are proposed and shown to improve performance on WebSplit when viewed as a sentence fusion task.
Emergence of Communication in an Interactive World with Consistent Speakers
A new model and training algorithm is proposed, that utilizes the structure of a learned representation space to produce more consistent speakers at the initial phases of training, which stabilizes learning and increases context-independence compared to policy gradient and other competitive baselines.
SCROLLS: Standardized CompaRison Over Long Language Sequences
This work introduces SCROLLS, a suite of tasks that require reasoning over long texts, and examines existing long-text datasets, and handpick ones where the text is naturally long, while prioritizing tasks that involve synthesizing information across the input.
Break, Perturb, Build: Automatic Perturbation of Reasoning Paths Through Question Decomposition
This work introduces the “Break, Perturb, Build” (BPB) framework for automatic reasoning-oriented perturbation of question-answer pairs, and demonstrates the effectiveness of BPB by creating evaluation sets for three reading comprehension benchmarks, generating thousands of high-quality examples without human intervention.
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Evaluation of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters finds that model performance and calibration both improve with scale, but are poor in absolute terms.