• Corpus ID: 238408058

How BPE Affects Memorization in Transformers

  title={How BPE Affects Memorization in Transformers},
  author={Eugene Kharitonov and Marco Baroni and Dieuwke Hupkes},
Training data memorization in NLP can both be beneficial (e.g., closed-book QA) and undesirable (personal data extraction). In any case, successful model training requires a non-trivial amount of memorization to store word spellings, various linguistic idiosyncrasies and common knowledge. However, little is known about what affects the memorization behavior of NLP models, as the field tends to focus on the equally important question of generalization. In this work, we demonstrate that the size… 

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

It is shown that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process, and that larger language models memorize training data faster across all settings.

Finding Memo: Extractive Memorization in Constrained Sequence Generation Tasks

It is demonstrated that extractive memorization poses a serious threat to NMT reliability by qualitatively and quantitatively characterizing the memorized samples as well as the model behavior in their vicinity.

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

This survey connects several lines of work from the pre-neural and neural era, by showing how hybrid approaches of words and characters as well as subwordbased approaches based on learned segmentation have been proposed and evaluated.

Do Language Models Plagiarize?

The findings support that language models, especially GPT-2, reuse particular pieces of texts from the training corpus with or without obfuscation, and implies that future research on neural language models should take precautions to avoid models plagiarizing their training datasets.

Recitation-Augmented Language Models

By utilizing recitation as the intermediate step, a recite-and-answer scheme can achieve new state-of-the-art performance in various closed-book question answering (CBQA) tasks.

Can Recurrent Neural Networks Validate Usage-Based Theories of Grammar Acquisition?

A mini-review gives an overview of the state of the Recurrent Artificial Neural Networks field, focusing on the influence of the theoretical framework in the interpretation of results.

A Mixture-of-Expert Approach to RL-based Dialogue Management

A novel mixture of expert language model (MoE-LM) that consists of a LM capable of learning diverse semantics for conversation histories, a number of specialized LMs capable of generating utterances corresponding to a particular attribute or personality, and a RL-based DM that performs dialogue planning with the utterances generated by the experts is developed.

Pixel-Level BPE for Auto-Regressive Image Generation

This paper proposes to tackle pixel-level autoregression with Transformer models by adopting Byte-Pair-Encoding (BPE) originally proposed for text processing to the image domain to drastically reduce the length of the modeled sequence.

Breaking the Representation Bottleneck of Chinese Characters: Neural Machine Translation with Stroke Sequence Modeling

Experiments show that StrokeNet can provide a signif-icant performance boost over the strong baselines with fewer model parameters, achieving 26.5 BLEU on the WMT17 Chinese-English task which is better than any previously re-ported results without using monolingual data.



Understanding Unintended Memorization in Language Models Under Federated Learning

This work discovers that the clustering of data according to users—which happens by design in FL—has the most significant effect in reducing such memorization, and demonstrates that training in FL with a user-level differential privacy guarantee results in models that can provide high utility while being resilient to memorizing out-of-distribution phrases.

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

This paper pre-train MLMs on sentences with randomly shuffled word order, and shows that these models still achieve high accuracy after fine-tuning on many downstream tasks—including tasks specifically designed to be challenging for models that ignore word order.

Déjà Vu: an empirical evaluation of the memorization properties of ConvNets

This paper shows how to detect which dataset was used to train a model, and in particular whether some validation images were used at train time, and proposes a new approach to infer membership when a few of the top layers are not available or have been fine-tuned.

What they do when in doubt: a study of inductive biases in seq2seq learners

This work investigates how popular seq2seq learners generalize in tasks that have high ambiguity in the training data, and connects to Solomonoff's theory of induction and proposes to use description length as a principled and sensitive measure of inductive biases.

Transcoding Compositionally: Using Attention to Find More Generalizable Solutions

This paper presents seq2attn, a new architecture that is specifically designed to exploit attention to find compositional patterns in the input, and exhibits overgeneralization to a larger degree than a standard sequence-to-sequence model.

Theoretical Limitations of Self-Attention in Neural Sequence Models

Across both soft and hard attention, strong theoretical limitations are shown of the computational abilities of self-attention, finding that it cannot model periodic finite-state languages, nor hierarchical structure, unless the number of layers or heads increases with input length.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.

Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks

This paper introduces the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences, and tests the zero-shot generalization capabilities of a variety of recurrent neural networks trained on SCAN with sequence-to-sequence methods.

Generalization through Memorization: Nearest Neighbor Language Models

It is suggested that learning similarity between sequences of text is easier than predicting the next word, and that nearest neighbor search is an effective approach for language modeling in the long tail.

Neural Machine Translation of Rare Words with Subword Units

This paper introduces a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units, and empirically shows that subword models improve over a back-off dictionary baseline for the WMT 15 translation tasks English-German and English-Russian by 1.3 BLEU.