Multi-armed bandits for online optimization of language model pre-training: the use case of dynamic masking

  title={Multi-armed bandits for online optimization of language model pre-training: the use case of dynamic masking},
  author={I{\~n}igo Urteaga and Moulay Draidia and Tomer Lancewicki and Shahram Khadivi},
Transformer-based language models (TLMs) provide state-of-the-art performance in many modern natural language processing applications. TLM training is conducted in two phases. First, the model is pretrained over large volumes of text to minimize a generic objective function, such as the Masked Language Model (MLM). Second, the model is fine-tuned in specific downstream tasks. Pre-training requires large volumes of data and high computational resources, while introducing many still unresolved… 



AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing

This comprehensive survey paper explains various core concepts like pretraining, Pretraining methods, pretraining tasks, embeddings and downstream adaptation methods, presents a new taxonomy of T-PTLMs and gives brief overview of various benchmarks including both intrinsic and extrinsic.

Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation

The Neural Mask Generator is validated on several question answering and text classification datasets using BERT and DistilBERT as the language models, on which it outperforms rule-based masking strategies, by automatically learning optimal adaptive maskings.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

It is consistently found that multi-phase adaptive pretraining offers large gains in task performance, and it is shown that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

ERNIE 2.0: A Continual Pre-training Framework for Language Understanding

A continual pre-training framework named ERNIE 2.0 which incrementally builds pre- training tasks and then learns pre-trained models on these constructed tasks via continual multi-task learning is proposed.

Effective Unsupervised Domain Adaptation with Adversarially Trained Language Models

This paper shows that careful masking strategies can bridge the knowledge gap of masked language models about the domains more effectively by allocating self-supervision where it is needed and proposes an effective training strategy by adversarially masking out those tokens which are harder to reconstruct by the underlying MLM.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets

A generative model for the validation error as a function of training set size is proposed, which learns during the optimization process and allows exploration of preliminary configurations on small subsets, by extrapolating to the full dataset.

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

A novel algorithm is introduced, Hyperband, for hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations.