• Corpus ID: 245335281

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

  title={Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP},
  author={Sabrina J. Mielke and Zaid Alyafeai and Elizabeth Salesky and Colin Raffel and Manan Dey and Matthias Gall{\'e} and Arun Raja and Chenglei Si and Wilson Y. Lee and Beno{\^i}t Sagot and Samson Tan},
What are the units of text that we want to model? From bytes to multi-word expressions, text can be analyzed and generated at many granularities. Until recently, most natural language processing (NLP) models operated over words, treating those as discrete and atomic tokens, but starting with byte-pair encoding (BPE), subword-based approaches have become dominant in many areas, enabling small vocabularies while still allowing for fast inference. Is the end of the road character-level model or… 

Figures from this paper

On the Effectiveness of Quasi Character-Level Models for Machine Translation

This work suggests that quasi-character-level models have practically the same generalization capabilities as character-based models but at lower computational costs and appear to help achieve greater consistency between domains than standard subword- level models, although the catastrophic forgetting problem is not mitigated.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

BLOOM is a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers and achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning.

What do tokens know about their characters and how do they know it?

The mechanisms through which PLMs acquire English-language character information during training are investigated and it is argued that this knowledge is acquired through multiple phenomena, including a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.

Language Modelling with Pixels

PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels, and is more robust to noisy text inputs than BERT, further confirming the benefits of modelling language with pixels.

Improving Tokenisation by Alternative Treatment of Spaces

An alterna- 013 tive tokenisation approach where spaces are treated as individual tokens are experiments, which show that the modi- 022 fied algorithms give improved performance on downstream NLP tasks that involve handling 024 complex words, whilst having no detrimental effect on performance in general natural lan- 026 guage understanding tasks.

Beyond Characters: Subword-level Morpheme Segmentation

This paper presents DeepSPIN’s submissions to the SIGMORPHON 2022 Shared Task on Morpheme Segmentation, and challenges the assumption that models for morphological tasks should be trained at the character level by building a transformer that generates morphemes as sequences of unigram language model-induced subwords.

The SIGMORPHON 2022 Shared Task on Morpheme Segmentation

The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and

Text Generation with Text-Editing Models

This tutorial provides a comprehensive overview of the text-edit based models and current state-of-the-art approaches analyzing their pros and cons.

Advancing protein language models with linguistics: a roadmap for improved interpretability

It is argued that guidance drawn from linguistics, a field specialized in analytical rule extraction from natural language data, can aid with building more interpretable protein LMs that have learned relevant domain-specific rules.

Word-order Typology in Multilingual BERT: A Case Study in Subordinate-Clause Detection

The capabilities and limitations of BERT and similar models are still unclear when it comes to learning syntactic abstractions, in particular across languages. In this paper, we use the task of



ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

This paper shows that a standard Transformer architecture can be used with minimal modifications to process byte sequences, characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and shows that byte-level models are competitive with their token-level counterparts.

An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks

Experimental results demonstrate that a hybrid approach of morphological segmentation followed by BPE works best in Korean to/from English machine translation and natural language understanding tasks such as KorNLI, KorSTS, NSMC, and PAWS-X.

A Statistical Extension of Byte-Pair Encoding

Experimental results with morphologically rich languages show that the proposed model achieves nearly-optimal BLEU scores and produces morphologically better word segmentations, which allows to outperform BPE’s generalization in the translation of sentences containing new words, as shown via human evaluation.

Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model

It is shown how the spellings of known words can help us deal with unknown words in open-vocabulary NLP tasks and beat previous work and establish state-of-the-art results on multiple datasets.

Comparing neural‐ and N‐gram‐based language models for word segmentation

This article proposes an approach based on a beam search algorithm and a language model working at the byte/character level, the latter component implemented either as an n‐gram model or a recurrent neural network.

Target-side Word Segmentation Strategies for Neural Machine Translation

It is demonstrated that linguistically informed target word segmentation is better suited for NMT, leading to improved translation quality on the order of magnitude of +0.5 BLEU and −0.9 TER for a medium-scale English→German translation task.

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Canine is presented, a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias.

Morphological and Language-Agnostic Word Segmentation for NMT

A critical difference between BPE and STE is identified and a simple pre-processing step for BPE is shown that considerably increases translation quality as evaluated by automatic measures.

Text segmentation with character-level text embeddings

This work proposes to learn text representations directly from raw character sequences by training a Simple Recurrent Network to predict the next character in text and uses the learned text embeddings as features in a supervised character level text segmentation and labeling task.

Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language

This paper revisits standard tokenization methods and introduces Word-Expressions-Based (WEB) tokenization, a human-involved super-words tokenization strategy to create a better representative vocabulary for training.