Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling

@article{Shen2018ReinforcedSN,
  title={Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling},
  author={Tao Shen and Tianyi Zhou and Guodong Long and Jing Jiang and Sen Wang and Chengqi Zhang},
  journal={ArXiv},
  year={2018},
  volume={abs/1801.10296}
}
Many natural language processing tasks solely rely on sparse dependencies between a few tokens in a sentence. [] Key Method In ReSA, a hard attention trims a sequence for a soft self-attention to process, while the soft attention feeds reward signals back to facilitate the training of the hard one. For this purpose, we develop a novel hard attention called "reinforced sequence sampling (RSS)", selecting tokens in parallel and trained via policy gradient.

Figures and Tables from this paper

Look Harder: A Neural Machine Translation Model with Hard Attention

A hard-attention based NMT model is proposed which selects a subset of source tokens for each target token to effectively handle long sequence translation and achieves significant BLEU score improvement on English-German (EN-DE) and English-French (ENFR) translation tasks.

Improving Self-Attention Networks With Sequential Relations

Experiments in natural language inference, machine translation and sentiment analysis tasks show that the sequential relation modeling helps self-attention networks outperform existing approaches.

Learning Hard Retrieval Cross Attention for Transformer

The hard retrieval attention mechanism can empirically accelerate the scaled dot-product attention for both long and short sequences by 66.5%, while performing competitively in a wide range of machine translation tasks when using for cross attention networks.

Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection

This work proposes a novel model called Explicit Sparse Transformer, able to improve the concentration of attention on the global context through an explicit selection of the most relevant segments in the context, and achieves comparable or better results than the previous sparse attention method, but significantly reduces training and testing time.

CGSPN : cascading gated self-attention and phrase-attention network for sentence modeling

A Cascading Gated Self-attention and Phrase-att attention Network (CGSPN) is proposed that generates the sentence embedding by considering contextual words and key phrases in a sentence by abstracting the semantic of phrases.

CGSPN : cascading gated self-attention and phrase-attention network for sentence modeling

A Cascading Gated Self-attention and Phrase-att attention Network (CGSPN) is proposed that generates the sentence embedding by considering contextual words and key phrases in a sentence by abstracting the semantic of phrases.

Assessing the Ability of Self-Attention Networks to Learn Word Order

Experimental results reveal that: 1) SAN trained on word reordering detection indeed has difficulty learning the positional information even with the position embedding; and 2) SANtrained on machine translation learns better positional information than its RNN counterpart, in which position embeddedding plays a critical role.

Sparse Transformer: Concentrated Attention Through Explicit Selection

A novel model called Sparse Transformer is proposed that is able to improve the concentration of attention on the global context through an explicit selection of the most relevant segments and reaches the state-of-the-art performances in the IWSLT 2015 English-to-Vietnamese translation and IW SLT 2014 German- to-English translation.

Learning Hard Retrieval Decoder Attention for Transformers

An approach to learning a hard retrieval attention where an attention head only attends to one token in the sentence rather than all tokens, which is 1.43 times faster in decoding and preserves translation quality on a wide range of machine translation tasks when used in the decoder selfand crossattention networks.

Multiple Positional Self-Attention Network for Text Classification

The result shows that the MPSAN outperforms state-of-the-art methods on five datasets and the test accuracy is improved by 0.81%, 0.6% on SST, CR datasets, respectively.
...

References

SHOWING 1-10 OF 58 REFERENCES

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Recurrent Neural Network-Based Sentence Encoder with Gated Attention for Natural Language Inference

This paper describes a model (alpha) that is ranked among the top in the Shared Task, on both the in- domain test set and on the cross-domain test set, demonstrating that the model generalizes well to theCross-domain data.

DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding

A novel attention mechanism in which the attention between elements from input sequence(s) is directional and multi-dimensional (i.e., feature-wise) and a light-weight neural net is proposed, based solely on the proposed attention without any RNN/CNN structure, which outperforms complicated RNN models on both prediction quality and time efficiency.

Effective Approaches to Attention-based Neural Machine Translation

A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions.

Coarse-to-Fine Question Answering for Long Documents

A framework for question answering that can efficiently scale to longer documents while maintaining or even improving performance of state-of-the-art models is presented and sentence selection is treated as a latent variable trained jointly from the answer only using reinforcement learning.

Learning Natural Language Inference using Bidirectional LSTM model and Inner-Attention

A sentence encoding-based model for recognizing text entailment that utilized the sentence's first-stage representation to attend words appeared in itself, which is called "Inner-Attention" in this paper.

Language Modeling with Gated Convolutional Networks

A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.

Learning to Skim Text

The proposed model is a modified LSTM with jumping, a recurrent network that learns how far to jump after reading a few words of the input text, which is up to 6 times faster than the standard sequential L STM, while maintaining the same or even better accuracy.

Bidirectional Attention Flow for Machine Comprehension

The BIDAF network is introduced, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization.

Shortcut-Stacked Sentence Encoders for Multi-Domain Inference

This work presents a simple sequential sentence encoder based on stacked bidirectional LSTM-RNNs with shortcut connections and fine-tuning of word embeddings that achieves the new state-of-the-art encoding result on the original SNLI dataset.
...