Highway Transformer: Self-Gating Enhanced Self-Attentive Networks

  title={Highway Transformer: Self-Gating Enhanced Self-Attentive Networks},
  author={Yekun Chai and Jin Shuo and Xinwen Hou},
Self-attention mechanisms have made striking state-of-the-art (SOTA) progress in various sequence learning tasks, standing on the multi-headed dot product attention by attending to all the global contexts at different locations. Through a pseudo information highway, we introduce a gated component self-dependency units (SDU) that incorporates LSTM-styled gating units to replenish internal semantic importance within the multi-dimensional latent space of individual representations. The subsidiary… Expand
Combination of Neural Machine Translation Systems at WMT20
This paper presents neural machine translation systems and their combination built for the WMT20 English-Polish and Japanese->English translation tasks and reveals that the presence of translationese texts in the validation data led them to take decisions in building NMT systems that were not optimal to obtain the best results on the test data. Expand
Transformer with Depth-Wise LSTM
This paper proposes to train Transformers with the depth-wise LSTM which regards outputs of layers as steps in time series instead of residual connections, under the motivation that the vanishing gradient problem suffered by deep networks is the same as recurrent networks applied to long sequences. Expand


Pay Less Attention with Lightweight and Dynamic Convolutions
It is shown that a very lightweight convolution can perform competitively to the best reported self-attention results, and dynamic convolutions are introduced which are simpler and more efficient than self-ATTention. Expand
R-Transformer: Recurrent Neural Network Enhanced Transformer
The R-Transformer is proposed which enjoys the advantages of both RNNs and the multi-head attention mechanism while avoids their respective drawbacks and can effectively capture both local structures and global long-term dependencies in sequences without any use of position embeddings. Expand
Transformer-XL: Attentive Language Models beyond a Fixed-Length Context
This work proposes a novel neural architecture Transformer-XL that enables learning dependency beyond a fixed length without disrupting temporal coherence, which consists of a segment-level recurrence mechanism and a novel positional encoding scheme. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Distance-based Self-Attention Network for Natural Language Inference
Attention mechanism has been used as an ancillary means to help RNN or CNN. However, the Transformer (Vaswani et al., 2017) recently recorded the state-of-the-art performance in machine translationExpand
Language Modeling with Gated Convolutional Networks
A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks. Expand
Highway Networks
A new architecture designed to ease gradient-based training of very deep networks, characterized by the use of gating units which learn to regulate the flow of information through a network is introduced. Expand
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence. Expand
A Gated Self-attention Memory Network for Answer Selection
This work takes a departure from the popular Compare-Aggregate architecture, and instead, proposes a new gated self-attention memory network for the answer selection task, which outperforms previous methods by a large margin. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand