This work analyzes Stochastic Gradient Descent with k-sparsification or compression (for instance top-k or random-k) and shows that this scheme converges at the same rate as vanilla SGD when equipped with error compensation.Expand

This work proves that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer, which provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice.Expand

A sparsification scheme for SGD where only a small constant number of coordinates are applied at each iteration, which outperforms QSGD in progress per number of bits sent and opens the path to using lock-free asynchronous parallelization on dense problems.Expand

This work proposes a new way to understand self-attention networks: it is shown that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers, and proves that self-Attention possesses a strong inductive bias towards “token uniformity”.Expand

This work proposes a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word and sentence representations and significantly improves cross-lingsual sentence retrieval performance over all other approaches while maintaining parity with the current state-of-the-art methods on word-translation.Expand

We provide a general self-attention formulation to impose group equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings that are invariant to the action of the… Expand

A collaborative multi-head attention layer that enables heads to learn shared projections and improves the computational cost and number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture.Expand

A graph neural network Conditioned on a path prefix, this network can efficiently extrapolate path suffixes, evaluate path likelihood, and sample from the future path distribution, and is able to adapt to graphs with very different properties.Expand

This work proposes a method based on a differentiable Top-K operator to select the most relevant parts of the input to efficiently process high resolution images and shows results for traffic sign recognition, inter-patch relationship reasoning, and fine-grained recognition without using object/part bounding box annotations during training.Expand

The supplementary material consists of the following: performance trade-offs associated with patch sampling versus running a CNN on the entire high resolution image (Appendix A), some theoretical… Expand