UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost
@inproceedings{Wu2021UniDropAS, title={UniDrop: A Simple yet Effective Technique to Improve Transformer without Extra Cost}, author={Zhen Wu and Lijun Wu and Qi Meng and Yingce Xia and Shufang Xie and Tao Qin and Xinyu Dai and Tie-Yan Liu}, booktitle={North American Chapter of the Association for Computational Linguistics}, year={2021} }
Transformer architecture achieves great success in abundant natural language processing tasks. The over-parameterization of the Transformer model has motivated plenty of works to alleviate its overfitting for superior performances. With some explorations, we find simple techniques such as dropout, can greatly boost model performance with a careful design. Therefore, in this paper, we integrate different dropout techniques into the training of Transformer models. Specifically, we propose an…
Figures and Tables from this paper
8 Citations
Relaxed Attention for Transformer Models
- Computer ScienceArXiv
- 2022
This paper explores relaxed attention, a simple and easy-to-implement smoothing of the attention weights, yielding a two-fold improvement to the general transformer architecture: first, relaxed attention provides regularization when applied to the self-attention layers in the encoder, and second, it naturally supports the integration of an external language model as it suppresses the implicitly learned internal language model by relaxing the cross attention in the decoder.
Denoising Self-Attentive Sequential Recommendation
- Computer ScienceRecSys
- 2022
In Rec-denoiser, each self-attention layer is attached with a trainable binary mask to prune noisy attentions, resulting in sparse and clean attention distributions that largely purifies item-item dependencies and provides better model interpretability.
Bi-SimCut: A Simple Strategy for Boosting Neural Machine Translation
- Computer ScienceNAACL
- 2022
Bi-SimCut and SimCut are introduced: a simple but effective training strategy to boost neural machine translation (NMT) performance that consists of bidirectional pretraining and unidirectional finetuning and can serve as strong baselines for future NMT research.
BayesFormer: Transformer with Uncertainty Estimation
- Computer ScienceArXiv
- 2022
This paper introduces BayesFormer, a Transformer model with dropouts designed by Bayesian theory, and proposed a new theoretical framework to extend the approximate variational inference-based dropout to Transformer-based architectures.
CipherDAug: Ciphertext based Data Augmentation for Neural Machine Translation
- Computer ScienceACL
- 2022
The method, CipherDAug, uses a co-regularization-inspired training procedure, requires no external data sources other than the original training data, and uses a standard Transformer to outperform strong data augmentation techniques on several datasets by a significant margin.
Long-Range Transformers for Dynamic Spatiotemporal Forecasting
- Computer ScienceArXiv
- 2021
This paper addresses multivariate forecasting into a “spatiotemporal sequence" formulation where each Transformer input token represents the value of a single variable at a given time and Long-Range Transformers can then learn interactions between space, time, and value information jointly along this extended sequence.
BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation
- Computer ScienceEMNLP
- 2021
This paper demonstrates that simply using the output of a tailored and suitable bilingual pre-trained language model (dubbed BiBERT) as the input of the NMT encoder achieves state-of-the-art translation performance and proposes a stochastic layer selection approach and a dual-directional translation model to ensure the sufficient utilization of contextualized embeddings.
Not All Attention Is All You Need
- Computer ScienceArXiv
- 2021
This paper proposes a novel dropout method named AttendOut to let self-attention empowered PrLMs capable of more robust task-specific tuning and demonstrates that state-of-the-art models with elaborate training design may achieve much stronger results.
References
SHOWING 1-10 OF 53 REFERENCES
Theoretical Analysis of Auto Rate-Tuning by Batch Normalization
- Computer ScienceICLR
- 2019
It is shown that even if the authors fix the learning rate of scale-invariant parameters to a constant, gradient descent still approaches a stationary point in the rate of T^{-1/2}$ in iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates.
Attention is All you Need
- Computer ScienceNIPS
- 2017
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Dropout: a simple way to prevent neural networks from overfitting
- Computer ScienceJ. Mach. Learn. Res.
- 2014
It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Alleviating the Inequality of Attention Heads for Neural Machine Translation
- Computer ScienceCOLING
- 2022
A simple masking method is proposed, HeadMask, in two specific ways, that achieves translation improvements on multiple language pairs and supports the assumption that the attention heads in Transformer are not equal.
Reducing Transformer Depth on Demand with Structured Dropout
- Computer ScienceICLR
- 2020
LayerDrop, a form of structured dropout, is explored, which has a regularization effect during training and allows for efficient pruning at inference time, and shows that it is possible to select sub-networks of any depth from one large network without having to finetune them and with limited impact on performance.
Regularizing Neural Networks by Penalizing Confident Output Distributions
- Computer ScienceICLR
- 2017
It is found that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers.
Improving Neural Machine Translation Models with Monolingual Data
- Computer ScienceACL
- 2016
This work pairs monolingual training data with an automatic back-translation, and can treat it as additional parallel training data, and obtains substantial improvements on the WMT 15 task English German, and for the low-resourced IWSLT 14 task Turkish->English.
Sequence Generation with Mixed Representations
- Computer ScienceICML
- 2020
This work introduces a new model architecture to incorporate mixed representations and a co-teaching algorithm to better utilize the diversity of different tokenization methods to leverage the mixed representations from different tokenizers for sequence generation tasks.
Multi-branch Attentive Transformer
- Computer ScienceArXiv
- 2020
A simple yet effective variant of Transformer called multi-branch attentive Transformer (briefly, MAT), where the attention layer is the average of multiple branches and each branch is an independent multi-head attention layer.
Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View
- Computer ScienceArXiv
- 2019
It is shown that the Transformer can be mathematically interpreted as a numerical Ordinary Differential Equation (ODE) solver for a convection-diffusion equation in a multi-particle dynamic system.