Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

  title={Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates},
  author={Taku Kudo},
  • Taku Kudo
  • Published in ACL 29 April 2018
  • Computer Science
Subword units are an effective way to alleviate the open vocabulary problems in neural machine translation (NMT. [] Key Method We present a simple regularization method, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training. In addition, for better subword sampling, we propose a new subword segmentation algorithm based on a unigram language model. We experiment with multiple corpora and report consistent improvements especially on low…

Figures and Tables from this paper

Single Model Ensemble for Subword Regularized Models in Low-Resource Machine Translation
An inference strategy that approximates the marginalized likelihood by using multiple segmentations including the most plausible segmentation and several sampled segmentations that improves the performance of models trained with subword regularization in low-resource machine translation tasks.
Multitask Learning For Different Subword Segmentations In Neural Machine Translation
Block Multitask Learning (BMTL), a novel NMT architecture that predicts multiple targets of different granularities simultaneously, removing the need to search for the optimal segmentation strategy, is presented.
BPE-Dropout: Simple and Effective Subword Regularization
BPE-dropout is introduced - simple and effective subword regularization method based on and compatible with conventional BPE that stochastically corrupts the segmentation procedure of BPE, which leads to producing multiple segmentations within the same fixed BPE framework.
Multi-view Subword Regularization
To take full advantage of different possible input segmentations, the proposed Multi-view Subword Regularization (MVR) method enforces the consistency of predictors between using inputs tokenized by the standard and probabilistic segmentations.
Subword Regularization: An Analysis of Scalability and Generalization for End-to-End Automatic Speech Recognition
A principled investigation on the regularizing effect of the subword segmentation sampling method for a streaming end-to-end speech recognition task and suggests that subword regularization provides a consistent improvement of (2-8%) relative word-error-rate reduction, even in a large-scale setting with datasets up to a size of 20k hours.
Bilingual Subword Segmentation for Neural Machine Translation
This paper proposed a new subword segmentation method for neural machine translation, “Bilingual Subword Segmentation,” which tokenizes sentences to minimize the difference between the number of
Morfessor EM+Prune: Improved Subword Segmentation with Expectation Maximization and Pruning
It is shown that this approach to training algorithms for a unigram subword model, based on the Expectation Maximization algorithm and lexicon pruning, is able to find better solutions to the optimization problem defined by the Morfessor Baseline model than its original recursive training algorithm.
Sub-Subword N-Gram Features for Subword-Level Neural Machine Translation
A novel approach that combines subword-level segmentation with character-level information in the form of character n-gram features to construct embedding matrices and softmax output projections for a standard encoderdecoder model that increases the vocabulary size for small training datasets without reducing translation quality.
Auxiliary Subword Segmentations as Related Languages for Low Resource Multilingual Translation
A novel technique that combines alternative subword tokenizations of a single source-target language pair that allows us to leverage multilingual neural translation training methods and improves translation accuracy for low-resource languages and produces translations that are lexically diverse and morphologically rich.
Improving Neural Machine Translation by Incorporating Hierarchical Subword Features
It is confirmed that incorporating hierarchical subword features in the encoder consistently improves BLEU scores on the IWSLT evaluation datasets and the assumption that in the NMT model, the appropriate subword units for the following three modules can differ is confirmed.


Effective Approaches to Attention-based Neural Machine Translation
A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions.
Unsupervised Neural Machine Translation
This work proposes a novel method to train an NMT system in a completely unsupervised manner, relying on nothing but monolingual corpora, and consists of a slightly modified attentional encoder-decoder model that can be trained on monolingUAL corpora alone using a combination of denoising and backtranslation.
Synthetic and Natural Noise Both Break Neural Machine Translation
It is found that a model based on a character convolutional neural network is able to simultaneously learn representations robust to multiple kinds of noise, including structure-invariant word representations and robust training on noisy texts.
Deep Unordered Composition Rivals Syntactic Methods for Text Classification
This work presents a simple deep neural network that competes with and, in some cases, outperforms such models on sentiment analysis and factoid question answering tasks while taking only a fraction of the training time.
Neural Lattice-to-Sequence Models for Uncertain Inputs
This work extends the TreeL STM into a LatticeLSTM that is able to consume word lattices, and can be used as encoder in an attentional encoder-decoder model, and integrates lattice posterior scores into this architecture.
Unsupervised Machine Translation Using Monolingual Corpora Only
This work proposes a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space and effectively learns to translate without using any labeled data.
Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation
Neural machine translation (NMT) heavily relies on word-level modelling to learn semantic representations of input sentences.However, for languages without natural word delimiters (e.g., Chinese)
A Neural Attention Model for Abstractive Sentence Summarization
This work proposes a fully data-driven approach to abstractive sentence summarization by utilizing a local attention-based model that generates each word of the summary conditioned on the input sentence.
Neural Machine Translation by Jointly Learning to Align and Translate
It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
A Neural Conversational Model
A simple approach to conversational modeling which uses the recently proposed sequence to sequence framework, and is able to extract knowledge from both a domain specific dataset, and from a large, noisy, and general domain dataset of movie subtitles.