• Publications
  • Influence
Sparsified SGD with Memory
This work analyzes Stochastic Gradient Descent with k-sparsification or compression (for instance top-k or random-k) and shows that this scheme converges at the same rate as vanilla SGD when equipped with error compensation. Expand
On the Relationship between Self-Attention and Convolutional Layers
This work proves that a multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer, which provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Expand
Convex Optimization using Sparsified Stochastic Gradient Descent with Memory
A sparsification scheme for SGD where only a small constant number of coordinates are applied at each iteration, which outperforms QSGD in progress per number of bits sent and opens the path to using lock-free asynchronous parallelization on dense problems. Expand
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
This work proposes a new way to understand self-attention networks: it is shown that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers, and proves that self-Attention possesses a strong inductive bias towards “token uniformity”. Expand
Robust Cross-lingual Embeddings from Parallel Sentences
This work proposes a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word and sentence representations and significantly improves cross-lingsual sentence retrieval performance over all other approaches while maintaining parity with the current state-of-the-art methods on word-translation. Expand
Group Equivariant Stand-Alone Self-Attention For Vision
We provide a general self-attention formulation to impose group equivariance to arbitrary symmetry groups. This is achieved by defining positional encodings that are invariant to the action of theExpand
Multi-Head Attention: Collaborate Instead of Concatenate
A collaborative multi-head attention layer that enables heads to learn shared projections and improves the computational cost and number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture. Expand
Extrapolating paths with graph neural networks
A graph neural network Conditioned on a path prefix, this network can efficiently extrapolate path suffixes, evaluate path likelihood, and sample from the future path distribution, and is able to adapt to graphs with very different properties. Expand
Differentiable Patch Selection for Image Recognition
This work proposes a method based on a differentiable Top-K operator to select the most relevant parts of the input to efficiently process high resolution images and shows results for traffic sign recognition, inter-patch relationship reasoning, and fine-grained recognition without using object/part bounding box annotations during training. Expand
Supplement: Differentiable Patch Selection for Image Recognition
The supplementary material consists of the following: performance trade-offs associated with patch sampling versus running a CNN on the entire high resolution image (Appendix A), some theoreticalExpand