• Corpus ID: 235624202

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

@article{Tay2021CharformerFC,
  title={Charformer: Fast Character Transformers via Gradient-based Subword Tokenization},
  author={Yi Tay and Vinh Quang Tran and Sebastian Ruder and Jai Gupta and Hyung Won Chung and Dara Bahri and Zhen Qin and Simon Baumgartner and Cong Yu and Donald Metzler},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.12672}
}
State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adap-tation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely… 
Local Byte Fusion for Neural Machine Translation
TLDR
Extensive experiments on multilingual translation, zero-shot cross-lingual transfer and domain adaptation reveal a consistent improvement over traditional byte-based models and even over subword techniques.
A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task Learning
TLDR
This work proposes a vocabulary-free neural tokenizer by distilling segmentation information from heuristic-based subword tokenization, which allows end-to-end task learning, resulting in optimal task-specific tokenization.
AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing
TLDR
This comprehensive survey paper explains various core concepts like pretraining, Pretraining methods, pretraining tasks, embeddings and downstream adaptation methods, presents a new taxonomy of T-PTLMs and gives brief overview of various benchmarks including both intrinsic and extrinsic.
Sub-Character Tokenization for Chinese Pretrained Language Models
TLDR
Experimental results show that SubChar tokenizers have two main advantages over existing tokenizers: 1) They can tokenize inputs into much shorter sequences, thus improving the computational efficiency, and 2) Pronunciation-based SubChartokenization can encode Chinese homophones into the same transliteration sequences and produce the same tokenization output, hence being robust to all homophone typos.
Lifting the Curse of Multilinguality by Pre-training Modular Transformers
TLDR
This work introduces language-specific modules of their Cross-lingual Modular models from the start, which allows them to grow the total capacity of the model, while keeping the total number of trainable parameters per language constant.
Patching Leaks in the Charformer for Efficient Character-Level Generation
TLDR
The GBST method from Charformer groups (aka downsamples) characters is used, thereby enabling character grouping in the decoder, and promising performance on English– Turkish translation indicate the potential of character-level models for morphologically rich languages.
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
TLDR
This survey connects several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subwordbased approaches based on learned segmentation have been proposed and evaluated.
Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios?
TLDR
This work shows that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre- trained on large multilingual and monolingual models.
A New Generation of Perspective API: Efficient Multilingual Character-level Transformers
TLDR
This paper presents the fundamentals behind the next version of the Perspective API from Google Jigsaw, and presents a single multilingual token-free Charformer model that is applicable across a range of languages, domains, and tasks.
Hierarchical Transformers Are More Efficient Language Models
TLDR
Hourglass is a hierarchical Transformer language model that sets new state-of-the-art for Transformer models on the ImageNet32 generation task and improves language modeling efficiency on the widely studied enwik8 benchmark.
...
...

References

SHOWING 1-10 OF 69 REFERENCES
CharBERT: Character-aware Pre-trained Language Model
TLDR
This paper proposes a character-aware pre-trained language model named CharBERT improving on the previous methods (such as BERT, RoBERTa) and proposes a new pre-training task named NLM (Noisy LM) for unsupervised character representation learning.
Byte Pair Encoding is Suboptimal for Language Model Pretraining
TLDR
Differences between BPE and unigram LM tokenization are analyzed, finding that the latter method recovers subword units that align more closely with morphology and avoids problems stemming from BPE’s greedy construction procedure.
Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
TLDR
Canine is presented, a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias.
ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models
TLDR
This paper shows that a standard Transformer architecture can be used with minimal modifications to process byte sequences, characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and shows that byte-level models are competitive with their token-level counterparts.
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
TLDR
This work proposes Funnel-Transformer, a model which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost and outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
TLDR
This work proposes CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters, and shows that this new model improves the performance of Bert on a variety of medical domain tasks while at the same time producing robust, word-level, and open-vocabulary representations.
Big Bird: Transformers for Longer Sequences
TLDR
It is shown that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.
On the Cross-lingual Transferability of Monolingual Representations
TLDR
This work designs an alternative approach that transfers a monolingual model to new languages at the lexical level and shows that it is competitive with multilingual BERT on standard cross-lingUAL classification benchmarks and on a new Cross-lingual Question Answering Dataset (XQuAD).
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation
TLDR
Empirical results on machine translation suggest that DPE is effective for segmenting output sentences and can be combined with BPE dropout for stochastic segmentation of source sentences.
...
...