• Corpus ID: 53109111

You May Not Need Attention

  title={You May Not Need Attention},
  author={Ofir Press and Noah A. Smith},
In NMT, how far can we get without attention and without separate encoding and decoding? To answer that question, we introduce a recurrent neural translation model that does not use attention and does not have a separate encoder and decoder. Our eager translation model is low-latency, writing target tokens as soon as it reads the first source token, and uses constant memory during decoding. It performs on par with the standard attention-based model of Bahdanau et al. (2014), and better on long… 

Figures and Tables from this paper

Is Encoder-Decoder Redundant for Neural Machine Translation?

This work investigates the concept of simply concatenating the source and target sentences and training a language model to do translation for machine translation, suggesting that an encoder-decoder architecture might be redundant for neural machine translation.

Infusing Future Information into Monotonic Attention Through Language Models

The proposed SNMT method improves the quality-latency trade-off over the state-of-the-art monotonic multihead attention and conducts experiments on the MuST-C English-German and English-French speech-to-text translation tasks to show the effectiveness of this framework.

STACL: Simultaneous Translation with Integrated Anticipation and Controllable Latency

A very simple yet surprisingly effective “wait-k” model trained to generate the target sentence concurrently with the source sentence, but always k words behind, for any given k is introduced.

Neural Simultaneous Speech Translation Using Alignment-Based Chunking

A neural machine translation (NMT) model that makes dynamic decisions when to continue feeding on input or generate output words, and compares models with bidirectional and unidirectional encoders of different depths, both on real speech and text input.

Simultaneous Translation with Flexible Policy via Restricted Imitation Learning

This work proposes a much simpler single model that adds a “delay” token to the target vocabulary, and designs a restricted dynamic oracle to greatly simplify training.

Speed Up the Training of Neural Machine Translation

A novel NMT model based on the conventional bidirectional recurrent neural network (bi-RNN) is proposed, which applies a tanh activation function, which can learn the future and history context information more sufficiently to speed up the training process.

Efficient Wait-k Models for Simultaneous Machine Translation

This work investigates the behavior of wait-k decoding in low resource settings for spoken corpora using IWSLT datasets, and improves training of these models using unidirectional encoders, and training across multiple values of k.

Monotonic Infinite Lookback Attention for Simultaneous Machine Translation

This work presents the first simultaneous translation system to learn an adaptive schedule jointly with a neural machine translation (NMT) model that attends over all source tokens read thus far, and shows that MILk’s adaptive schedule allows it to arrive at latency-quality trade-offs that are favorable to those of a recently proposed wait-k strategy for many latency values.

Comprehension of Subtitles from Re-Translating Simultaneous Speech Translation

The results show that the subtitling layout or flicker have a little effect on comprehension, in contrast to machine translation itself and individual competence, and that users with a limited knowledge of the source language have different preferences to stability and latency than the users with zero knowledge.

Transformer-Based Direct Hidden Markov Model for Machine Translation

This work proposes to introduce the concept of the hidden Markov model to the transformer architecture, which outperforms the transformer baseline and finds that the zero-order model already provides promising performance, giving it an edge compared to a model with first-order dependency.



How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures

This work takes a fine-grained look at the different architectures for NMT and introduces an Architecture Definition Language (ADL) allowing for a flexible combination of common building blocks and shows that self-attention is much more important on the encoder side than on the decoder side.

Neural Hidden Markov Model for Machine Translation

It is shown that the attention component can be effectively replaced by the neural network alignment model and the neural HMM approach is able to provide comparable performance with the state-of-the-art attention-based models on the WMT 2017 German↔English and Chinese→English translation tasks.

Effective Approaches to Attention-based Neural Machine Translation

A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Online Segment to Segment Neural Transduction

An online neural sequence to sequence model that learns to alternate between encoding and decoding segments of the input as it is read that tackles the bottleneck of vanilla encoder-decoders that have to read and memorize the entire input sequence in their fixed-length hidden states.

Neural Phrase-based Machine Translation

In this paper, we propose Neural Phrase-based Machine Translation (NPMT). Our method explicitly models the phrase structures in output sequences through Sleep-WAke Networks (SWAN), a recently

Online and Linear-Time Attention by Enforcing Monotonic Alignments

This work proposes an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time and validates the approach on sentence summarization, machine translation, and online speech recognition problems.

STACL: Simultaneous Translation with Integrated Anticipation and Controllable Latency

A very simple yet surprisingly effective “wait-k” model trained to generate the target sentence concurrently with the source sentence, but always k words behind, for any given k is introduced.

Regularizing and Optimizing LSTM Language Models

This paper proposes the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization and introduces NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user.

Learning to Translate in Real-time with Neural Machine Translation

A neural machine translation (NMT) framework for simultaneous translation in which an agent learns to make decisions on when to translate from the interaction with a pre-trained NMT environment is proposed.