Variational Pretraining for Semi-supervised Text Classification

  title={Variational Pretraining for Semi-supervised Text Classification},
  author={Suchin Gururangan and Tam Dang and Dallas Card and Noah A. Smith},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
We introduce VAMPIRE, a lightweight pretraining framework for effective text classification when data and computing resources are limited. We pretrain a unigram document model as a variational autoencoder on in-domain, unlabeled data and use its internal states as features in a downstream classifier. Empirically, we show the relative strength of VAMPIRE against computationally expensive contextual embeddings and other popular semi-supervised baselines under low resource settings. We also find… 

Challenging the Semi-Supervised VAE Framework for Text Classification

This paper questions the adequacy of the standard design of sequence SSVAEs for the task of text classification as they exhibit two sources of overcomplexity, and provides simplifications to preserve their theoretical soundness and provide a better flow of information into the latent variables.

MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

By mixing labeled, unlabeled and augmented data, MixText significantly outperformed current pre-trained and fined-tuned models and other state-of-the-art semi-supervised learning methods on several text classification benchmarks.

Modern Variational Methods for Semi-Supervised Text Classification

This thesis provides a comprehensive exposition of the modernization of semi- supervised variational methods using semi-supervised text classification as a case study, and introduces VAMPIRE, a descendant of topic models, to adapt neural variational document modeling to modern pretraining methods in order to produce powerful feature extractors that are especially motivated when resources are limited.

Benefits from Variational Regularization in Language Models

Surprisingly, features extracted at the sentence level also show competitive results on benchmark classification tasks and a token-level variational loss to a Transformer architecture and optimizing the standard deviation of the prior distribution in the loss function as the model parameter to increase isotropy is suggested.

Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space

This paper proposes the first large-scale language VAE model, Optimus, a universal latent embedding space for sentences that is first pre-trained on large text corpus, and then fine-tuned for various language generation and understanding tasks.

Automatic Document Selection for Efficient Encoder Pretraining

Cynical Data Selection is extended, a statistical sentence scoring method that conditions on a representative target domain corpus and consistently outperforms random selection with 20x less data, 3x fewer training iterations, and 2x less estimated cloud compute cost.

AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing

This comprehensive survey paper explains various core concepts like pretraining, Pretraining methods, pretraining tasks, embeddings and downstream adaptation methods, presents a new taxonomy of T-PTLMs and gives brief overview of various benchmarks including both intrinsic and extrinsic.

AdaVAE: Exploring Adaptive GPT-2s in Variational Auto-Encoders for Language Modeling

This paper unify both the encoder&decoder of the VAE model using GPT-2s with adaptive parameter-efficient components, and further introduce Latent Attention operation to better construct latent space from transformer models.

Contrast-Enhanced Semi-supervised Text Classification with Few Labels

A certainty-driven sample selection method and a contrast-enhanced similarity graph are proposed to utilize data more efficiently in self-training, alleviating the annotation-starving problem.

Attending to Long-Distance Document Context for Sequence Labeling

This work's model learns to attend to multiple mentions of the same word type in generating a representation for each token in context, extending that work to learning representations that can be incorporated into modern neural models.



Universal Language Model Fine-tuning for Text Classification

This work proposes Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduces techniques that are key for fine- Tuning a language model.

Virtual Adversarial Training for Semi-Supervised Text Classification

This work extends adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself.

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

The benefits of supplementary training with further training on data-rich supervised tasks, such as natural language inference, obtain additional performance improvements on the GLUE benchmark, as well as observing reduced variance across random restarts in this setting.

Improved Variational Autoencoders for Text Modeling using Dilated Convolutions

It is shown that with the right decoder, VAE can outperform LSTM language models, and perplexity gains are demonstrated on two datasets, representing the first positive experimental result on the use VAE for generative modeling of text.

Variational Autoencoder for Semi-Supervised Text Classification

Semi-supervised Sequential Variational Autoencoder (SSVAE) is proposed, which increases the capability by feeding label into its decoder RNN at each time-step, and reduces the computational complexity in training.

Deep Unordered Composition Rivals Syntactic Methods for Text Classification

This work presents a simple deep neural network that competes with and, in some cases, outperforms such models on sentiment analysis and factoid question answering tasks while taking only a fraction of the training time.

Semi-supervised Learning with Deep Generative Models

It is shown that deep generative models and approximate Bayesian inference exploiting recent advances in variational methods can be used to provide significant improvements, making generative approaches highly competitive for semi-supervised learning.

Dissecting Contextual Word Embeddings: Architecture and Representation

There is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks, suggesting that unsupervised biLMs, independent of architecture, are learning much more about the structure of language than previously appreciated.

Deep Contextualized Word Representations

A new type of deep contextualized word representation is introduced that models both complex characteristics of word use and how these uses vary across linguistic contexts, allowing downstream models to mix different types of semi-supervision signals.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.