• Publications
  • Influence
Long-term recurrent convolutional networks for visual recognition and description
TLDR
We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. Expand
  • 3,251
  • 413
  • PDF
Localizing Moments in Video with Natural Language
TLDR
We propose the Moment Context Network (MCN) which effectively localizes natural language queries in videos by integrating local and global video features over time. Expand
  • 164
  • 43
  • PDF
Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data
TLDR
In this work, we propose the Deep Compositional Captioner (DCC) to address the task of generating descriptions of novel objects which are not present in paired imagesentence datasets. Expand
  • 171
  • 27
  • PDF
Long-term recurrent convolutional networks for visual recognition and description
TLDR
We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. Expand
  • 641
  • 23
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
TLDR
We propose a new model which can jointly generate visual and textual explanations, using an attention mask to localize salient regions when generating textual rationales. Expand
  • 137
  • 19
  • PDF
Captioning Images with Diverse Objects
TLDR
We propose the Novel Object Captioner (NOC), a deep visual semantic captioning model that can describe a large number of object categories not present in existing image-caption datasets. Expand
  • 91
  • 14
  • PDF
Generating Visual Explanations
TLDR
We propose a new model that focuses on the discriminating properties of the visible object, jointly predicts a class label, and explains why the predicted label is appropriate for the image. Expand
  • 308
  • 13
  • PDF
Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training
TLDR
We change the training objective of the caption generator from reproducing ground-truth captions to generating a set of captions that is indistinguishable from human written captions. Expand
  • 63
  • 12
  • PDF
Deep learning for tactile understanding from visual and haptic data
TLDR
Robots which interact with the physical world will benefit from a fine-grained tactile understanding of objects and surfaces. Expand
  • 132
  • 10
  • PDF
Multimodal Video Description
TLDR
We based our multimodal video description network on the state-of-the-art sequence to sequence video to text (S2VT) model and extended it to take advantage of multiple modalities. Expand
  • 85
  • 8
  • PDF