How Does Selective Mechanism Improve Self-Attention Networks?

@article{Geng2020HowDS,
  title={How Does Selective Mechanism Improve Self-Attention Networks?},
  author={Xinwei Geng and Longyue Wang and Xing Wang and Bing Qin and Ting Liu and Zhaopeng Tu},
  journal={ArXiv},
  year={2020},
  volume={abs/2005.00979}
}
Self-attention networks (SANs) with selective mechanism has produced substantial improvements in various NLP tasks by concentrating on a subset of input words. However, the underlying reasons for their strong performance have not been well explained. In this paper, we bridge the gap by assessing the strengths of selective SANs (SSANs), which are implemented with a flexible and universal Gumbel-Softmax. Experimental results on several representative NLP tasks, including natural language… 
Not All Attention Is All You Need
TLDR
This paper proposes a novel dropout method named AttendOut to let self-attention empowered PrLMs capable of more robust task-specific tuning and demonstrates that state-of-the-art models with elaborate training design may achieve much stronger results.
Effects of Similarity Score Functions in Attention Mechanisms on the Performance of Neural Question Answering Systems
TLDR
A baseline model is proposed that captures the common components of recurrent neural network-based Question Answering (QA) systems found in the literature and isolating the attention function allows us to study the effects of different similarity score functions on the performance of such systems.
Enhancing Attention Models via Multi-head Collaboration
  • Huadong Wang, Mei Tu
  • Computer Science
    2020 International Conference on Asian Language Processing (IALP)
  • 2020
TLDR
Empirical study and experimental results prove the existence of the problem of multi-head collaboration and propose a simple but effective method to enhance the collaboration of different attention heads, which allows different heads to have the chance to rectify their attention scores with other heads.
Learning Hard Retrieval Cross Attention for Transformer
TLDR
The hard retrieval attention mechanism can empirically accelerate the scaled dot-product attention for both long and short sequences by 66.5%, while performing competitively in a wide range of machine translation tasks when using for cross attention networks.
Learning Hard Retrieval Decoder Attention for Transformers
TLDR
An approach to learning a hard retrieval attention where an attention head only attends to one token in the sentence rather than all tokens, which is 1.43 times faster in decoding and preserves translation quality on a wide range of machine translation tasks when used in the decoder selfand crossattention networks.
Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling
TLDR
It is found that non-selective attention sharing is sub-optimal for achieving good generalization across all languages and domains and further proposed attention sharing strategies to facilitate parameter sharing and specialization in multilingual and multi-domain sequence modeling.
Context-Aware Cross-Attention for Non-Autoregressive Translation
TLDR
Experimental results show that this approach can consistently improve translation quality over strong NAT baselines and extensive analyses demonstrate that the enhanced cross-attention achieves better exploitation of source contexts by leveraging both local and global information.
Rethinking the Value of Transformer Components
TLDR
This work evaluates the impact of individual component (sub-layer) in trained Transformer models from different perspectives and proposes a new training strategy that can improves translation performance by distinguishing the unimportant components in training.
Learning to Refine Source Representations for Neural Machine Translation
TLDR
This work proposes a novel encoder-refiner-decoder framework, which dynamically refines the source representations based on the generated target-side information at each decoding step, and shows that the proposed approach significantly and consistently improves translation performance over the standard encoding framework.
...
...

References

SHOWING 1-10 OF 46 REFERENCES
Assessing the Ability of Self-Attention Networks to Learn Word Order
TLDR
Experimental results reveal that: 1) SAN trained on word reordering detection indeed has difficulty learning the positional information even with the position embedding; and 2) SANtrained on machine translation learns better positional information than its RNN counterpart, in which position embeddedding plays a critical role.
Self-Attention with Structural Position Representations
TLDR
This work uses dependency tree to represent the grammatical structure of a sentence, and proposes two strategies to encode the positional relationships among words in the dependency tree.
Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures
TLDR
The experimental results show that: 1) self-attentional networks and CNNs do not outperform RNNs in modeling subject-verb agreement over long distances; 2)Self-att attentional networks perform distinctly better than RNN's and CNN's on word sense disambiguation.
Towards Better Modeling Hierarchical Structure for Self-Attention with Ordered Neurons
TLDR
This work proposes to further enhance the strength of hybrid models with an advanced variant of RNNs – Ordered Neurons LSTM (ON-LSTM), which introduces a syntax-oriented inductive bias to perform tree-like composition.
Convolutional Self-Attention Networks
TLDR
Novel convolutional self-attention networks are proposed, which offer SANs the abilities to strengthen dependencies among neighboring elements, and model the interaction between features extracted by multiple attention heads.
Effective Approaches to Attention-based Neural Machine Translation
TLDR
A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions.
Phrase-level Self-Attention Networks for Universal Sentence Encoding
TLDR
Phrase-level Self-Attention Networks (PSAN) that perform self-attention across words inside a phrase to capture context dependencies at the phrase level, and use the gated memory updating mechanism to refine each word’s representation hierarchically with longer-term context dependencies captured in a larger phrase are proposed.
Modeling Localness for Self-Attention Networks
TLDR
This work cast localness modeling as a learnable Gaussian bias, which indicates the central and scope of the local region to be paid more attention in self-attention networks, to maintain the strength of capturing long distance dependencies while enhance the ability of capturing short-range dependencies.
Deep Semantic Role Labeling with Self-Attention
TLDR
This paper presents a simple and effective architecture for SRL which is based on self-attention which can directly capture the relationships between two tokens regardless of their distance and is computationally efficient.
Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling
TLDR
An RNN/CNN-free sentence-encoding model, "reinforced self-attention network (ReSAN)", solely based on ReSA is proposed, which achieves state-of-the-art performance on both the Stanford Natural Language Inference and the Sentences Involving Compositional Knowledge datasets.
...
...