SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

@inproceedings{Kudo2018SentencePieceAS,
  title={SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing},
  author={Taku Kudo and John Richardson},
  booktitle={EMNLP},
  year={2018}
}
This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We… Expand

Paper Mentions

Stochastic Tokenization with a Language Model for Neural Text Classification
TLDR
This model incorporates a language model for unsupervised tokenization into a text classifier and then trains both models simultaneously, which achieves better performance than previous methods. Expand
A unified approach to sentence segmentation of punctuated text in many languages
The sentence is a fundamental unit of text processing. Yet sentences in the wild are commonly encountered not in isolation, but unsegmented within larger paragraphs and documents. Therefore, theExpand
POS-Tagging based Neural Machine Translation System for European Languages using Transformers
The interaction between human beings has always faced different kinds of difficulties. One of those difficulties is the language barrier. It would be a tedious task for someone to learn all theExpand
POS-Tagging based Neural Machine Translation System for European Languages using Transformers
The interaction between human beings has always faced different kinds of difficulties. One of those difficulties is the language barrier. It would be a tedious task for someone to learn all theExpand
ByT5: Towards a token-free future with pre-trained byte-to-byte models
TLDR
It is shown that a standard Transformer architecture can be used with minimal modifications to process byte sequences, and it is demonstrated that byte-level models are competitive with their token-level counterparts and perform better on tasks that are sensitive to spelling and pronunciation. Expand
Sequence Generation with Mixed Representations
TLDR
This work introduces a new model architecture to incorporate mixed representations and a co-teaching algorithm to better utilize the diversity of different tokenization methods to leverage the mixed representations from different tokenizers for sequence generation tasks. Expand
Neural Machine Translation with Byte-Level Subwords
TLDR
This paper investigates byte-level subwords, specificallybyte-level BPE (BBPE), which is compacter than character vocabulary and has no out-of-vocabulary tokens, but is more efficient than using pure bytes only is and claims that contextualizing BBPE embeddings is necessary, which can be implemented by a convolutional or recurrent layer. Expand
scb-mt-en-th-2020: A Large English-Thai Parallel Corpus
TLDR
The primary objective of this work is to build a large-scale English-Thai dataset with over 1 million segment pairs, curated from various sources, that are comparable to that of Google Translation API for Thai-English and outperform Google when the Open Parallel Corpus is included in the training data. Expand
PunKtuator: A Multilingual Punctuation Restoration System for Spoken and Written Text
TLDR
A multitask modeling approach is described as a system to restore punctuation in multiple high resource – Germanic (English and German), Romanic (French)– and low resource languages – Indo-Aryan (Hindi) Dravidian (Tamil) – that does not require extensive knowledge of grammar or syntax of a given language for both spoken and written form of text. Expand
Leveraging Neural Machine Translation for Word Alignment
TLDR
This work summarizes different approaches on how word-alignment can be extracted from alignment scores and explores ways in which scores can be extraction from NMT, focusing on inferring the word- alignment scores based on output sentence and token probabilities. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 17 REFERENCES
Neural Machine Translation of Rare Words with Subword Units
TLDR
This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU. Expand
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
TLDR
A simple regularization method is presented, subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training, and a new sub word segmentation algorithm based on a unigram language model is proposed. Expand
Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation
TLDR
This work proposes a simple solution to use a single Neural Machine Translation (NMT) model to translate between multiple languages using a shared wordpiece vocabulary, and introduces an artificial token at the beginning of the input sentence to specify the required target language. Expand
Unsupervised Neural Machine Translation
TLDR
This work proposes a novel method to train an NMT system in a completely unsupervised manner, relying on nothing but monolingual corpora, and consists of a slightly modified attentional encoder-decoder model that can be trained on monolingUAL corpora alone using a combination of denoising and backtranslation. Expand
Neural Machine Translation by Jointly Learning to Align and Translate
TLDR
It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. Expand
Unsupervised Machine Translation Using Monolingual Corpora Only
TLDR
This work proposes a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space and effectively learns to translate without using any labeled data. Expand
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
TLDR
GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models. Expand
A Neural Attention Model for Abstractive Sentence Summarization
TLDR
This work proposes a fully data-driven approach to abstractive sentence summarization by utilizing a local attention-based model that generates each word of the summary conditioned on the input sentence. Expand
A Neural Conversational Model
TLDR
A simple approach to conversational modeling which uses the recently proposed sequence to sequence framework, and is able to extract knowledge from both a domain specific dataset, and from a large, noisy, and general domain dataset of movie subtitles. Expand
Effective Approaches to Attention-based Neural Machine Translation
TLDR
A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions. Expand
...
1
2
...