• Corpus ID: 155100151

Joint Source-Target Self Attention with Locality Constraints

@article{Fonollosa2019JointSS,
  title={Joint Source-Target Self Attention with Locality Constraints},
  author={Jos{\'e} A. R. Fonollosa and Noe Casas and Marta Ruiz Costa-juss{\`a}},
  journal={ArXiv},
  year={2019},
  volume={abs/1905.06596}
}
The dominant neural machine translation models are based on the encoder-decoder structure, and many of them rely on an unconstrained receptive field over source and target sequences. [] Key Method As input for training, both source and target sentences are fed to the network, which is trained as a language model. At inference time, the target tokens are predicted autoregressively starting with the source sequence as previous tokens. The proposed model achieves a new state of the art of 35.7 BLEU on IWSLT'14…

Figures and Tables from this paper

Self-Attention and Dynamic Convolution Hybrid Model for Neural Machine Translation
TLDR
A hybrid model is proposed that combines a self-attention module and a dynamic convolution module by taking a weighted sum of their outputs where the weights can be dynamically learned by the model during training.
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation
TLDR
This paper proposes to replace all but one attention head of each encoder layer with simple fixed – non-learnable – attentive patterns that are solely based on position and do not require any external knowledge.
Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection
TLDR
This work proposes a novel model called Explicit Sparse Transformer, able to improve the concentration of attention on the global context through an explicit selection of the most relevant segments in the context, and achieves comparable or better results than the previous sparse attention method, but significantly reduces training and testing time.
Multi-Unit Transformer for Neural Machine Translation
TLDR
This paper proposes the Multi-Unit Transformers (MUTE), which aim to promote the expressiveness of the Transformer by introducing diverse and complementary units and shows that modeling with multiple units improves model performance and introduces diversity.
Multi-Unit Transformers for Neural Machine Translation
TLDR
This paper proposes the Multi-Unit Transformers (MUTE), which aim to promote the expressiveness of the Transformer by introducing diverse and complementary units and shows that modeling with multiple units improves model performance and introduces diversity.
Pointwise Conv Projection Shared Projection Output Representations Attention Input Representations
  • Computer Science
  • 2019
TLDR
This work proposes the Parallel MUlti-Scale attEntion (MUSE) and MUSE-simple, and finds that although conceptually simple, its success in practice requires intricate considerations, and the multi-scale attention must build on unified semantic space.
MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning
TLDR
This work proposes the Parallel MUlti-Scale attEntion (MUSE) and MUSE-simple, and finds that although conceptually simple, its success in practice requires intricate considerations, and the multi-scale attention must build on unified semantic space.
Multi-split Reversible Transformers Can Enhance Neural Machine Translation
TLDR
This work designs three types of multi-split based reversible transformers and devise a corresponding backpropagation algorithm, which does not need to store activations for most layers and presents two fine-tuning techniques: splits shuffle and self ensemble, to boost translation accuracy.
Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding
TLDR
This work proposes a simple yet effective attention guiding mechanism to improve the performance of PLMs through encouraging the attention towards the established goals and proposes two kinds of attention guiding methods, i.e., the attention map discrimination guiding (MDG) and the attention pattern decorrelation guiding (PDG).
...
...

References

SHOWING 1-10 OF 20 REFERENCES
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Weighted Transformer Network for Machine Translation
TLDR
Weighted Transformer is proposed, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15-40% faster.
Training Deeper Neural Machine Translation Models with Transparent Attention
TLDR
A simple modification to the attention mechanism is proposed that eases the optimization of deeper models, and results in consistent gains of 0.7-1.1 BLEU on the benchmark WMT’14 English-German and W MT’15 Czech-English tasks.
Effective Approaches to Attention-based Neural Machine Translation
TLDR
A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions.
Self-Attention with Relative Position Representations
TLDR
This work presents an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements, on the WMT 2014 English-to-German and English- to-French translation tasks.
Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
TLDR
This work proposes an alternative approach which instead relies on a single 2D convolutional neural network across both sequences, which outperforms state-of-the-art encoder-decoder systems, while being conceptually simpler and having fewer parameters.
Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation
TLDR
The concept of layer-wise coordination for NMT is proposed, which explicitly coordinates the learning of hidden representations of the encoder and decoder together layer by layer, gradually from low level to high level.
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Neural Machine Translation by Jointly Learning to Align and Translate
TLDR
It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation
TLDR
This paper identifies several key modeling and training techniques, and applies them to the RNN architecture, yielding a new RNMT+ model that outperforms all of the three fundamental architectures on the benchmark WMT’14 English to French and English to German tasks.
...
...