Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks

@article{Sun2022TokenizationCM,
  title={Tokenization Consistency Matters for Generative Models on Extractive NLP Tasks},
  author={Kaiser Sun and Peng Qi and Yuhao Zhang and Lan Liu and William Yang Wang and Zhiheng Huang},
  journal={ArXiv},
  year={2022},
  volume={abs/2212.09912}
}
Generative models have been widely applied to solve extractive tasks, where parts of the input is extracted to form the desired output, and achieved significant success. For example, in extractive question answering (QA), generative models have constantly yielded state-of-the-art results. In this work, we identify the issue of tokenization inconsistency that is commonly neglected in training these models. This issue damages the extractive nature of these tasks after the input and output are… 

References

SHOWING 1-10 OF 29 REFERENCES

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

UnifiedQA: Crossing Format Boundaries With a Single QA System

This work uses the latest advances in language modeling to build a single pre-trained QA model, UNIFIEDQA, that performs well across 19 QA datasets spanning 4 diverse formats, and results in a new state of the art on 10 factoid and commonsense question answering datasets.

Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

Interestingly, it is observed that the performance of this method significantly improves when increasing the number of retrieved passages, evidence that sequence-to-sequence models offers a flexible framework to efficiently aggregate and combine evidence from multiple passages.

Byte Pair Encoding is Suboptimal for Language Model Pretraining

Differences between BPE and unigram LM tokenization are analyzed, finding that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE’s greedy construction procedure.

Language Models are Few-Shot Learners

GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

Improving the Numerical Reasoning Skills of Pretrained Language Models

This paper proposes a new extended pretraining approach called reasoning-aware pretraining to jointly address both shortcomings without requiring architectural changes or pretraining from scratch.

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.

CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic Parsing

This work proposes CLASP, a simple method to improve low-resource SP for moderate-sized models: synthetic data from AlexaTM 20B is generated to augment the training set for a model 40x smaller (500M parameters) and shows significant improvements over strong baseline methods.

MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension

In this task, 18 distinct question answering datasets were adapted and unified into the same format and the best system achieved an average F1 score of 72.5 on the 12 held-out datasets.

DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension

DuoRC is proposed, a novel dataset for Reading Comprehension (RC) that motivates several new challenges for neural approaches in language understanding beyond those offered by existing RC datasets and could complement other RC datasets to explore novel neural approaches for studying language understanding.