Location Attention for Extrapolation to Longer Sequences

@article{Dubois2020LocationAF,
  title={Location Attention for Extrapolation to Longer Sequences},
  author={Yann Dubois and Gautier Dagan and Dieuwke Hupkes and Elia Bruni},
  journal={ArXiv},
  year={2020},
  volume={abs/1911.03872}
}
Neural networks are surprisingly good at interpolating and perform remarkably well when the training set examples resemble those in the test set. However, they are often unable to extrapolate patterns beyond the seen data, even when the abstractions required for such patterns are simple. In this paper, we first review the notion of extrapolation, why it is important and how one could hope to tackle it. We then focus on a specific type of extrapolation which is especially useful for natural… Expand
The EOS Decision and Length Extrapolation
TLDR
It is found that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40% improvement over +E OS in the difficult SCAN dataset length generalization task. Expand
Test Harder than You Train: Probing with Extrapolation Splits
Previous work on probing word representations for linguistic knowledge has focused on interpolation tasks. In this paper, we instead analyse probes in an extrapolation setting, where the inputs atExpand
$\infty$-former: Infinite Memory Transformer
TLDR
The ∞-former is proposed, which extends the vanilla transformer with an unbounded long-term memory and is able to model arbitrarily long contexts and maintain “sticky memories” while keeping a fixed computation budget. Expand
Compositionality Decomposed: How do Neural Networks Generalise?
TLDR
A set of tests that provide a bridge between the vast amount of linguistic and philosophical theory about compositionality of language and the successful neural models of language are presented, which uncover the strengths and weaknesses of these three architectures and point to potential areas of improvement. Expand
∞-former: Infinite Memory Transformer
TLDR
The ∞-former is proposed, which extends the vanilla transformer with an unbounded long-term memory and is able to model arbitrarily long contexts and maintain “sticky memories” while keeping a fixed computation budget. Expand
The compositionality of neural networks: integrating symbolism and connectionism
TLDR
A set of tests that provide a bridge between the vast amount of linguistic and philosophical theory about compositionality and the successful neural models of language and apply the resulting tests to three popular sequence-to-sequence models. Expand
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization
TLDR
This novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task, as well as near-perfect accuracy onThe simple arithmetic task and a new variant of ListOps testing for generalization across computational depth. Expand
Measuring Systematic Generalization in Neural Proof Generation with Transformers
TLDR
It is observed that models that are not trained to generate proofs are better at generalizing to problems based on longer proofs, which suggests that Transformers have efficient internal reasoning strategies that are harder to interpret. Expand
How BPE Affects Memorization in Transformers
TLDR
It is demonstrated that the size of the subword vocabulary learned by Byte-Pair Encoding greatly affects both ability and tendency of standard Transformer models to memorize training data, even when the authors control for the number of learned parameters. Expand

References

SHOWING 1-10 OF 42 REFERENCES
The Fine Line between Linguistic Generalization and Failure in Seq2Seq-Attention Models
TLDR
It is found that a model's ability to generalize on a simple symbol rewriting task with a clearly defined structure depends greatly on the chosen random seed, even when performance on the standard test set remains the same. Expand
Memorize or generalize? Searching for a compositional RNN in a haystack
TLDR
This paper proposes the lookup table composition domain as a simple setup to test compositional behaviour and shows that it is theoretically possible for a standard RNN to learn to behave compositionally in this domain when trained with standard gradient descent and provided with additional supervision. Expand
Learning compositionally through attentive guidance
TLDR
Attentive Guidance, a mechanism to direct a sequence to sequence model equipped with attention to find more compositional solutions, is introduced, and it is shown that vanilla sequence tosequence models with attention overfit the training distribution, while the guided versions come up with Compositional solutions that fit the training and testing distributions almost equally well. Expand
End-To-End Memory Networks
TLDR
A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Sequence to Sequence Learning with Neural Networks
TLDR
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier. Expand
Measuring abstract reasoning in neural networks
TLDR
A dataset and challenge designed to probe abstract reasoning, inspired by a well-known human IQ test, is proposed and ways to both measure and induce stronger abstract reasoning in neural networks are introduced. Expand
Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks
TLDR
This paper introduces the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences, and tests the zero-shot generalization capabilities of a variety of recurrent neural networks trained on SCAN with sequence-to-sequence methods. Expand
Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks
TLDR
This paper introduces the SCAN domain, consisting of a set of simple compositional navigation commands paired with the corresponding action sequences, and tests the zero-shot generalization capabilities of a variety of recurrent neural networks (RNNs) trained on SCAN with sequence-to-sequence methods. Expand
Self-Attention with Relative Position Representations
TLDR
This work presents an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements, on the WMT 2014 English-to-German and English- to-French translation tasks. Expand
...
1
2
3
4
5
...