• Publications
  • Influence
Peeking Into the Future: Predicting Future Person Activities and Locations in Videos
TLDR
We propose an end-to-end, multi-task learning system utilizing rich visual features about human behavioral information and interaction with their surroundings. Expand
  • 88
  • 15
  • PDF
Focal Visual-Text Attention for Visual Question Answering
TLDR
We propose a novel neural network called Focal Visual-Text Attention network (FVTA) for collective reasoning in visual question answering, where both visual and text sequence information such as images and text metadata are presented. Expand
  • 61
  • 3
  • PDF
MemexQA: Visual Memex Question Answering
TLDR
This paper proposes a new task, MemexQA: given a collection of photos or videos from a user, the goal is to automatically answer questions that help users recover their memory about events captured in the collection. Expand
  • 18
  • 2
  • PDF
The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction
TLDR
This paper studies the problem of predicting the distribution over multiple possible future paths of people as they move through various visual scenes. Expand
  • 20
  • 2
  • PDF
Minding the Gaps in a Video Action Analysis Pipeline
TLDR
We present an event detection system, which shares many similarities with standard object detection pipelines. Expand
  • 9
  • 2
Peeking Into the Future: Predicting Future Person Activities and Locations in Videos
TLDR
We propose an end-to-end, multi-task learning system utilizing rich visual features about human behavioral information and interaction with their surroundings to predict a pedestrian's future path jointly with future activities. Expand
  • 8
  • 2
Learning to Detect Concepts from Webly-Labeled Video Data
Learning detectors that can recognize concepts, such as people actions, objects, etc., in video content is an interesting but challenging problem. In this paper, we study the problem of automaticallyExpand
  • 44
  • 1
  • PDF
Focal Visual-Text Attention for Memex Question Answering
TLDR
This paper proposes a new multimodal MemexQA task: given a sequence of photos from a user, the goal is to automatically answer questions that help users recover their memory about an event captured in these photos. Expand
  • 25
  • 1
  • PDF
MSNet: A Multilevel Instance Segmentation Network for Natural Disaster Damage Assessment in Aerial Videos
TLDR
In this paper, we study the problem of efficiently assessing building damage after natural disasters like hurricanes, floods or fires, through aerial video analysis. Expand
  • 3
  • 1
  • PDF
Video Description Generation using Audio and Visual Cues
TLDR
The recent advances in image captioning stimulate the research in generating natural language description for visual content, which can be widely applied in many applications. Expand
  • 25
  • PDF