Corpus ID: 49667762

Universal Transformers

@article{Dehghani2019UniversalT,
  title={Universal Transformers},
  author={Mostafa Dehghani and Stephan Gouws and Oriol Vinyals and Jakob Uszkoreit and Lukasz Kaiser},
  journal={ArXiv},
  year={2019},
  volume={abs/1807.03819}
}
Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. [...] Key Method UTs combine the parallelizability and global receptive field of feed-forward sequence models like the Transformer with the recurrent inductive bias of RNNs. We also add a dynamic per-position halting mechanism and find that it improves accuracy on several tasks.Expand
A Practical Survey on Faster and Lighter Transformers
TLDR
This survey investigates popular approaches to make the Transformer faster and lighter and provides a comprehensive explanation of the methods' strengths, limitations, and underlying assumptions to meet the desired trade-off between capacity, computation, and memory. Expand
R-Transformer: Recurrent Neural Network Enhanced Transformer
TLDR
The R-Transformer is proposed which enjoys the advantages of both RNNs and the multi-head attention mechanism while avoids their respective drawbacks and can effectively capture both local structures and global long-term dependencies in sequences without any use of position embeddings. Expand
Recurrent Stacking of Layers for Compact Neural Machine Translation Models
TLDR
It is empirically show that the translation quality of a model that recurrently stacks a single layer 6 times is comparable to the translationquality of a models that stacks 6 separate layers. Expand
Finetuning Pretrained Transformers into RNNs
TLDR
This work proposes a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, the softmax attention is replaced with its linear-complexity recurrent alternative and then finetune, which provides an improved tradeoff between efficiency and accuracy over the standard transformer and other recurrent variants. Expand
I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths
TLDR
I-BERT is proposed, a bi-directional Transformer that replaces positional encodings with a recurrent layer that inductively generalizes on a variety of algorithmic tasks where state-of-the-art Transformer models fail to do so. Expand
Thinking Like Transformers
TLDR
This paper proposes a computational model for the transformer-encoder in the form of a programming language, the Restricted Access Sequence Processing Language (RASP), and shows how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer. Expand
Stabilizing Transformers for Reinforcement Learning
TLDR
The proposed architecture, the Gated Transformer-XL (GTrXL), surpasses LSTMs on challenging memory environments and achieves state-of-the-art results on the multi-task DMLab-30 benchmark suite, exceeding the performance of an external memory architecture. Expand
Multi-split Reversible Transformers Can Enhance Neural Machine Translation
TLDR
This work designs three types of multi-split based reversible transformers and devise a corresponding backpropagation algorithm, which does not need to store activations for most layers and presents two fine-tuning techniques: splits shuffle and self ensemble, to boost translation accuracy. Expand
Random Feature Attention
TLDR
RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, is proposed and explored, showing that RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Expand
Non-autoregressive Machine Translation with Disentangled Context Transformer
State-of-the-art neural machine translation models generate a translation from left to right and every step is conditioned on the previously generated tokens. The sequential nature of this generationExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 36 REFERENCES
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Weighted Transformer Network for Machine Translation
TLDR
Weighted Transformer is proposed, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15-40% faster. Expand
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier. Expand
Learning to Execute
TLDR
This work developed a new variant of curriculum learning that improved the networks' performance in all experimental conditions and had a dramatic impact on an addition problem, making an LSTM to add two 9-digit numbers with 99% accuracy. Expand
Convolutional Sequence to Sequence Learning
TLDR
This work introduces an architecture based entirely on convolutional neural networks, which outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT-French translation at an order of magnitude faster speed, both on GPU and CPU. Expand
End-To-End Memory Networks
TLDR
A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings. Expand
Tracking the World State with Recurrent Entity Networks
TLDR
The EntNet sets a new state-of-the-art on the bAbI tasks, and is the first method to solve all the tasks in the 10k training examples setting, and can generalize past its training horizon. Expand
The Importance of Being Recurrent for Modeling Hierarchical Structure
TLDR
This work compares the two architectures—recurrent versus non-recurrent—with respect to their ability to model hierarchical structure and finds that recurrency is indeed important for this purpose. Expand
Adaptive Computation Time for Recurrent Neural Networks
TLDR
Performance is dramatically improved and insight is provided into the structure of the data, with more computation allocated to harder-to-predict transitions, such as spaces between words and ends of sentences, which suggests that ACT or other adaptive computation methods could provide a generic method for inferring segment boundaries in sequence data. Expand
Memory Architectures in Recurrent Neural Network Language Models
TLDR
The results demonstrate the value of stack-structured memory for explaining the distribution of words in natural language, in line with linguistic theories claiming a context-free backbone for natural language. Expand
...
1
2
3
4
...