UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost

  title={UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost},
  author={Zhen Wu and Lijun Wu and Qi Meng and Yingce Xia and Shufang Xie and Tao Qin and Xinyu Dai and Tie-Yan Liu},
  booktitle={North American Chapter of the Association for Computational Linguistics},
Transformer architecture achieves great success in abundant natural language processing tasks. The over-parameterization of the Transformer model has motivated plenty of works to alleviate its overfitting for superior performances. With some explorations, we find simple techniques such as dropout, can greatly boost model performance with a careful design. Therefore, in this paper, we integrate different dropout techniques into the training of Transformer models. Specifically, we propose an… 

Relaxed Attention for Transformer Models

This paper explores relaxed attention, a simple and easy-to-implement smoothing of the attention weights, yielding a two-fold improvement to the general transformer architecture: first, relaxed attention provides regularization when applied to the self-attention layers in the encoder, and second, it naturally supports the integration of an external language model as it suppresses the implicitly learned internal language model by relaxing the cross attention in the decoder.

Denoising Self-Attentive Sequential Recommendation

In Rec-denoiser, each self-attention layer is attached with a trainable binary mask to prune noisy attentions, resulting in sparse and clean attention distributions that largely purifies item-item dependencies and provides better model interpretability.

Bi-SimCut: A Simple Strategy for Boosting Neural Machine Translation

Bi-SimCut and SimCut are introduced: a simple but effective training strategy to boost neural machine translation (NMT) performance that consists of bidirectional pretraining and unidirectional finetuning and can serve as strong baselines for future NMT research.

BayesFormer: Transformer with Uncertainty Estimation

This paper introduces BayesFormer, a Transformer model with dropouts designed by Bayesian theory, and proposed a new theoretical framework to extend the approximate variational inference-based dropout to Transformer-based architectures.

CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation

The method, CipherDAug, uses a co-regularization-inspired training procedure, requires no external data sources other than the original training data, and uses a standard Transformer to outperform strong data augmentation techniques on several datasets by a significant margin.

Long-Range Transformers for Dynamic Spatiotemporal Forecasting

This paper addresses multivariate forecasting into a “spatiotemporal sequence" formulation where each Transformer input token represents the value of a single variable at a given time and Long-Range Transformers can then learn interactions between space, time, and value information jointly along this extended sequence.

BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation

This paper demonstrates that simply using the output of a tailored and suitable bilingual pre-trained language model (dubbed BiBERT) as the input of the NMT encoder achieves state-of-the-art translation performance and proposes a stochastic layer selection approach and a dual-directional translation model to ensure the sufficient utilization of contextualized embeddings.

Not All Attention Is All You Need

This paper proposes a novel dropout method named AttendOut to let self-attention empowered PrLMs capable of more robust task-specific tuning and demonstrates that state-of-the-art models with elaborate training design may achieve much stronger results.



Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

It is shown that even if the authors fix the learning rate of scale-invariant parameters to a constant, gradient descent still approaches a stationary point in the rate of T^{-1/2}$ in iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Dropout: a simple way to prevent neural networks from overfitting

It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.

Alleviating the Inequality of Attention Heads for Neural Machine Translation

A simple masking method is proposed, HeadMask, in two specific ways, that achieves translation improvements on multiple language pairs and supports the assumption that the attention heads in Transformer are not equal.

Reducing Transformer Depth on Demand with Structured Dropout

LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance.

Regularizing Neural Networks by Penalizing Confident Output Distributions

It is found that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers.

Improving Neural Machine Translation Models with Monolingual Data

This work pairs monolingual training data with an automatic back-translation, and can treat it as additional parallel training data, and obtains substantial improvements on the WMT 15 task English German, and for the low-resourced IWSLT 14 task Turkish->English.

Sequence Generation with Mixed Representations

This work introduces a new model architecture to incorporate mixed representations and a co-teaching algorithm to better utilize the diversity of different tokenization methods to leverage the mixed representations from different tokenizers for sequence generation tasks.

Multi-branch Attentive Transformer

A simple yet effective variant of Transformer called multi-branch attentive Transformer (briefly, MAT), where the attention layer is the average of multiple branches and each branch is an independent multi-head attention layer.

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

It is shown that the Transformer can be mathematically interpreted as a numerical Ordinary Differential Equation (ODE) solver for a convection-diffusion equation in a multi-particle dynamic system.