• Publications
  • Influence
The Right Tool for the Job: Matching Model and Instance Complexities
TLDR
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit" from neural network calculations for simple instances, and late (and accurate) exit for hard instances. Expand
  • 13
  • 3
  • PDF
Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation
TLDR
We re-examine the trade-off and argue that transformer-based autoregressive models can be substantially sped up without loss in accuracy. Expand
  • 18
  • 1
  • PDF
Improving Transformer Models by Reordering their Sublayers
TLDR
We propose a new transformer stack, the sandwich transformer, and show that it improves perplexity on multiple word-level and character-level language modeling benchmarks, at no cost in parameters, memory, or training time. Expand
  • 12
  • 1
  • PDF
Multi-View Learning for Vision-and-Language Navigation
TLDR
We present a novel training paradigm, Learn from EveryOne (LEO), which leverages multiple instructions (as different views) for the same trajectory to resolve language ambiguity and improve generalization. Expand
  • 3
  • 1
  • PDF
Shortformer: Better Language Modeling using Shorter Inputs
TLDR
We explore the benefits of decreasing the input length of transformers. Expand
  • 1
  • 1
  • PDF
End-to-End Neural Segmental Models for Speech Recognition
TLDR
We study end-to-end segmental models with different weight functions, including ones based on frame-level neural classifiers and on segmental recurrent neural networks. Expand
  • 16
  • PDF
A Formal Hierarchy of RNN Architectures
TLDR
We develop a formal hierarchy of the expressive capacity of RNN architectures based on two formal properties: space complexity, which measures the RNN's memory, and rational recurrence, defined as whether the recurrent update can be described by a weighted finite-state machine. Expand
  • 13
  • PDF
A Mixture of h-1 Heads is Better than h Heads
Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention headsExpand
  • 1
  • PDF
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
TLDR
We present Rationale^VT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Expand
  • 1
  • PDF
Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings
TLDR
We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. Expand