ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

@article{Xue2022ByT5TA,
  title={ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models},
  author={Linting Xue and Aditya Barua and Noah Constant and Rami Al-Rfou and Sharan Narang and Mihir Kale and Adam Roberts and Colin Raffel},
  journal={Transactions of the Association for Computational Linguistics},
  year={2022},
  volume={10},
  pages={291-306}
}
Most widely used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: They can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. Because byte or character sequences are longer than token sequences, past work on token… 
Learning to Look Inside: Augmenting Token-Based Encoders with Character-Level Information
TLDR
It is shown that incorporating XRAYEMB’s learned vectors into sequences of pre- trained token embeddings helps performance on both autoregressive and masked pre-trained transformer architectures and on both sequence-level and sequence tagging tasks, particularly on nonstandard English text.
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
TLDR
This survey connects several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subwordbased approaches based on learned segmentation have been proposed and evaluated.
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization
TLDR
This paper introduces a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion and paves the way for highly performant token-free models that are trained completely end-to-end.
Sub-Character Tokenization for Chinese Pretrained Language Models
TLDR
Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency, and 2) Pronunciation-based SubChartokenization can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to all homophone typos.
AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing
TLDR
This comprehensive survey paper explains various core concepts like pretraining, Pretraining methods, pretraining tasks, embeddings and downstream adaptation methods, presents a new taxonomy of T-PTLMs and gives brief overview of various benchmarks including both intrinsic and extrinsic.
Evaluating Various Tokenizers for Arabic Text Classification
TLDR
This paper introduces three new tokenization algorithms for Arabic and compares them to three other baselines using unsupervised evaluations and shows that the performance of suchtokenization algorithms depends on the size of the dataset, type of the task, and the amount of morphology that exists in the dataset.
Revisiting the Primacy of English in Zero-shot Cross-lingual Transfer
TLDR
English is compared against other transfer languages for fine-tuning, and other high-resource languages such as German and Russian often transfer more effectively, especially when the set of target languages is diverse or unknown a priori.
Demystifying Neural Language Models' Insensitivity to Word-Order
TLDR
The insensitivity of natural language models to word-order is investigated by quantifying perturbations and analysing their effect on neural models’ performance on language understanding tasks in GLUE benchmark and it is found that neural language models — pretrained and non-pretrained Transformers, LSTMs, and Convolutional architectures — require local ordering more than the global ordering of tokens.
Correcting diacritics and typos with ByT5 transformer model
TLDR
This work tackles diacritics restoration and typos correction at once by employing the newly-developed universal ByT5 byte-level seq2seq transformer model that requires no language-specific model structures and strongly outperforms classical spell-checking or dictionary-based approaches.
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers
TLDR
This paper presents the fundamentals behind the next version of the Perspective API from Google Jigsaw, and presents a single multilingual token-free Charformer model that is applicable across a range of languages, domains, and tasks.
...
...

References

SHOWING 1-10 OF 75 REFERENCES
Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
TLDR
Canine is presented, a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias.
Multilingual Language Processing From Bytes
TLDR
An LSTM-based model that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in the authors' vocabulary is described.
Neural Machine Translation with Byte-Level Subwords
TLDR
This paper investigates byte-level subwords, specificallybyte-level BPE (BBPE), which is compacter than character vocabulary and has no out-of-vocabulary tokens, but is more efficient than using pure bytes only is and claims that contextualizing BBPE embeddings is necessary, which can be implemented by a convolutional or recurrent layer.
Revisiting Character-Based Neural Machine Translation with Capacity and Compression
TLDR
The modeling problem can be solved by standard sequence-to-sequence architectures of sufficient depth, and that deep models operating at the character level outperform identical models operating over word fragments, implying that alternative architectures for handling character input are better viewed as methods for reducing computation time than as improved ways of modeling longer sequences.
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
TLDR
SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, finds that it is possible to achieve comparable accuracy to direct subword training from raw sentences.
mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer
The recent “Text-to-Text Transfer Transformer” (T5) leveraged a unified text-to-text format and scale to attain state-of-the-art results on a wide variety of English-language NLP tasks. In this
Bridging the Gap for Tokenizer-Free Language Models
TLDR
This paper trains a vanilla transformer network with 40 self-attention layers on the One Billion Word (lm1b) benchmark and achieves a new state of the art for tokenizer-free LMs, pushing these models to be on par with their word-based counterparts.
Training Multilingual Pre-trained Language Model with Byte-level Subwords
TLDR
In the experiment, the architecture of NEZHA was adopted as the underlying pre-trained language model and the results show that NEzHA trained with byte-level subwords consistently outperforms Google multilingual BERT and vanillaNEZHA by a notable margin in several multilingual NLU tasks.
Neural Machine Translation of Rare Words with Subword Units
TLDR
This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.
Fully Character-Level Neural Machine Translation without Explicit Segmentation
TLDR
A neural machine translation model that maps a source character sequence to a target character sequence without any segmentation is introduced, allowing the model to be trained at a speed comparable to subword-level models while capturing local regularities.
...
...