Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality

  title={Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality},
  author={Gustavo Aguilar and Bryan McCann and Tong Niu and Nazneen Rajani and Nitish Shirish Keskar and Thamar Solorio},
Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the tokenizer to handle infrequent spellings. On the other hand, though robust to misspellings, pure character-level models often lead to unreasonably long sequences and make it harder for the model to learn meaningful words. To alleviate these challenges, we propose a… 

Figures and Tables from this paper

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP
This survey connects several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subwordbased approaches based on learned segmentation have been proposed and evaluated.


Revisiting Character-Based Neural Machine Translation with Capacity and Compression
The modeling problem can be solved by standard sequence-to-sequence architectures of sufficient depth, and that deep models operating at the character level outperform identical models operating over word fragments, implying that alternative architectures for handling character input are better viewed as methods for reducing computation time than as improved ways of modeling longer sequences.
Pooled Contextualized Embeddings for Named Entity Recognition
This work proposes a method in which it dynamically aggregate contextualized embeddings of each unique string that the authors encounter and uses a pooling operation to distill a ”global” word representation from all contextualized instances.
Mimicking Word Embeddings using Subword RNNs
MIMICK is presented, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributionalembeddings by performing learning at the type level of the original word embedding corpus.
Contextual String Embeddings for Sequence Labeling
This paper proposes to leverage the internal states of a trained character language model to produce a novel type of word embedding which they refer to as contextual string embeddings, which are fundamentally model words as sequences of characters and are contextualized by their surrounding text.
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
A simple regularization method is presented, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training, and a new sub word segmentation algorithm based on a unigram language model is proposed.
Neural Machine Translation of Rare Words with Subword Units
This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.
Deep Contextualized Word Representations
A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.
Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models
A novel word-character solution to achieving open vocabulary NMT that can successfully learn to not only generate well-formed words for Czech, a highly-inflected language with a very complex vocabulary, but also build correct representations for English source words.
Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation
A model for constructing vector representations of words by composing characters using bidirectional LSTMs that requires only a single vector per character type and a fixed set of parameters for the compositional model, which yields state- of-the-art results in language modeling and part-of-speech tagging.
Multilingual Part-of-Speech Tagging with Bidirectional Long Short-Term Memory Models and Auxiliary Loss
This work presents a novel bi-LSTM model, which combines the POS tagging loss function with an auxiliary loss function that accounts for rare words, which obtains state-of-the-art performance across 22 languages, and works especially well for morphologically complex languages.