Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks
@article{Sun2022TokenizationCM, title={Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks}, author={Kaiser Sun and Peng Qi and Yuhao Zhang and Lan Liu and William Yang Wang and Zhiheng Huang}, journal={ArXiv}, year={2022}, volume={abs/2212.09912} }
Generative models have been widely applied to solve extractive tasks, where parts of the input is extracted to form the desired output, and achieved significant success. For example, in extractive question answering (QA), generative models have constantly yielded state-of-the-art results. In this work, we identify the issue of tokenization inconsistency that is commonly neglected in training these models. This issue damages the extractive nature of these tasks after the input and output are…
Figures and Tables from this paper
References
SHOWING 1-10 OF 29 REFERENCES
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- Computer ScienceJ. Mach. Learn. Res.
- 2020
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
UnifiedQA: Crossing Format Boundaries With a Single QA System
- Computer ScienceFINDINGS
- 2020
This work uses the latest advances in language modeling to build a single pre-trained QA model, UNIFIEDQA, that performs well across 19 QA datasets spanning 4 diverse formats, and results in a new state of the art on 10 factoid and commonsense question answering datasets.
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
- Computer ScienceEACL
- 2021
Interestingly, it is observed that the performance of this method significantly improves when increasing the number of retrieved passages, evidence that sequence-to-sequence models offers a flexible framework to efficiently aggregate and combine evidence from multiple passages.
Byte Pair Encoding is Suboptimal for Language Model Pretraining
- Computer ScienceFINDINGS
- 2020
Differences between BPE and unigram LM tokenization are analyzed, finding that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE’s greedy construction procedure.
Language Models are Few-Shot Learners
- Computer ScienceNeurIPS
- 2020
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Improving the Numerical Reasoning Skills of Pretrained Language Models
- Computer ScienceArXiv
- 2022
This paper proposes a new extended pretraining approach called reasoning-aware pretraining to jointly address both shortcomings without requiring architectural changes or pretraining from scratch.
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
- Computer ScienceACL
- 2017
It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.
CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic Parsing
- Computer ScienceAACL
- 2022
This work proposes CLASP, a simple method to improve low-resource SP for moderate-sized models: synthetic data from AlexaTM 20B is generated to augment the training set for a model 40x smaller (500M parameters) and shows significant improvements over strong baseline methods.
MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension
- Computer ScienceEMNLP
- 2019
In this task, 18 distinct question answering datasets were adapted and unified into the same format and the best system achieved an average F1 score of 72.5 on the 12 held-out datasets.
DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension
- Computer ScienceACL
- 2018
DuoRC is proposed, a novel dataset for Reading Comprehension (RC) that motivates several new challenges for neural approaches in language understanding beyond those offered by existing RC datasets and could complement other RC datasets to explore novel neural approaches for studying language understanding.