• Publications
  • Influence
Leveraging Video Descriptions to Learn Video Question Answering
TLDR
A scalable approach to learn video-based question answering (QA): answer a "free-form natural language question" about a video content and a self-paced learning procedure to iteratively identify non-perfect candidate QA pairs is proposed and shown to be effective.
Omnidirectional CNN for Visual Place Recognition and Navigation
TLDR
A novel Omnidirectional Convolutional Neural Network (O-CNN) is proposed to handle severe camera pose variation and achieves state-of-the-art accuracy and speed with both the virtual world and real-world datasets.
Title Generation for User Generated Videos
TLDR
This work addresses video title generation for the first time by proposing two methods that extend state-of-the-art video captioners to this new task and proposes a novel sentence augmentation method to train a captioner with additional sentence-only examples that come without corresponding videos.
Agent-Centric Risk Assessment: Accident Anticipation and Risky Region Localization
TLDR
A novel soft-attention Recurrent Neural Network (RNN) which explicitly models both spatial and appearance-wise non-linear interaction between the agent triggering the event and another agent or static-region involved is proposed.
Visual Forecasting by Imitating Dynamics in Natural Sequences
TLDR
A general framework for visual forecasting, which directly imitates visual sequences without additional supervision by formulating visual forecasting as an inverse reinforcement learning (IRL) problem, and directly imitate the dynamics in natural sequences from their raw pixel values.
AllenAct: A Framework for Embodied AI Research
TLDR
AllenAct is introduced, a modular and flexible learning framework designed with a focus on the unique requirements of Embodied AI research that provides first-class support for a growing collection of embodied environments, tasks and algorithms.
Style Example-Guided Text Generation using Generative Adversarial Transformers
TLDR
This work introduces a language generative model framework for generating a styled paragraph based on a context sentence and a style reference example and proposes a novel objective function to train the framework.
Semantic Highlight Retrieval and Term Prediction
TLDR
This work focuses on user-generated viral videos, which typically contain short highlight marked by users, and proposes a query-dependent video representation for retrieving a variety of highlights that outperforms all baseline methods on the publicly available video highlight dataset.
Self-view Grounding Given a Narrated 360° Video
TLDR
A novel Visual Grounding Model (VGM) is proposed to implicitly and efficiently predict the NFoVs of a 360◦ video given subtitles of the narrative given the video content and subtitles and achieves state-of-the-art NFoV-grounding performance.
Visual Reaction: Learning to Play Catch With Your Drone
TLDR
The results show that the model that integrates a forecaster with a planner outperforms a set of strong baselines that are based on tracking as well as pure model-based and model-free RL baselines.
...
1
2
...