• Publications
  • Influence
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
TLDR
In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. Expand
  • 1,426
  • 354
  • PDF
SPICE: Semantic Propositional Image Caption Evaluation
TLDR
We propose a new automated caption evaluation metric that captures human judgments over model-generated captions better than other automatic metrics. Expand
  • 560
  • 183
  • PDF
Decomposing a scene into geometric and semantically consistent regions
TLDR
We propose a region-based model which combines appearance and scene geometry to automatically decompose a scene into semantically meaningful regions. Expand
  • 682
  • 92
  • PDF
Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments
TLDR
A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. Expand
  • 355
  • 84
  • PDF
Bottom-Up and Top-Down Attention for Image Captioning and VQA
TLDR
We propose a combined bottom-up and topdown visual attention mechanism that enables attention to be calculated at the level of objects and other salient image regions, while the top-down mechanism determines feature weightings. Expand
  • 247
  • 78
Dynamic Image Networks for Action Recognition
TLDR
We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis especially when convolutional neural networks (CNNs) are used. Expand
  • 402
  • 65
  • PDF
Multi-Class Segmentation with Relative Location Prior
TLDR
In this paper, we improve on state-of-the-art multiclass image segmentation labeling techniques by using contextual information that captures spatial relationships between classes. Expand
  • 401
  • 37
  • PDF
Single image depth estimation from predicted semantic labels
TLDR
We consider the problem of estimating the depth of each pixel in a scene from a single monocular image. Expand
  • 411
  • 26
  • PDF
Projected Subgradient Methods for Learning Sparse Gaussians
TLDR
We present a new algorithm for optimizing the ℓ1-penalized log-likelihood objective for sparse GMRFs in a high-dimensional space. Expand
  • 150
  • 20
  • PDF
Self-Supervised Video Representation Learning with Odd-One-Out Networks
TLDR
We propose a new self-supervised CNN pre-training technique based on a novel auxiliary task called odd-one-out learning, which generalizes to other related tasks such as action recognition. Expand
  • 214
  • 17
  • PDF
...
1
2
3
4
5
...