• Publications
  • Influence
Peeking Into the Future: Predicting Future Person Activities and Locations in Videos
TLDR
An end-to-end, multi-task learning system utilizing rich visual features about human behavioral information and interaction with their surroundings is proposed, providing the first empirical evidence that joint modeling of paths and activities benefits future path prediction.
The Garden of Forking Paths: Towards Multi-Future Trajectory Prediction
TLDR
A new model to generate multiple plausible future trajectories, which contains novel designs of using multi-scale location encodings and convolutional RNNs over graphs, is introduced, referred to as Multiverse.
Focal Visual-Text Attention for Visual Question Answering
TLDR
A novel neural network called Focal Visual-Text Attention network (FVTA) is described for collective reasoning in visual question answering, where both visual and text sequence information such as images and text metadata are presented.
Peeking Into the Future: Predicting Future Person Activities and Locations in Videos
TLDR
An end-to-end, multi-task learning system utilizing rich visual features about human behavioral information and interaction with their surroundings is proposed, providing the first empirical evidence that joint modeling of paths and activities benefits future path prediction.
MemexQA: Visual Memex Question Answering
TLDR
Experimental results on the MemexQA dataset demonstrate that MemexNet outperforms strong baselines and yields the state-of-the-art on this novel and challenging task, and suggest Memex net's efficacy and scalability across various QA tasks.
Learning to Detect Concepts from Webly-Labeled Video Data
TLDR
This paper presents compelling insights on the latent non-convex robust loss that is being minimized on the noisy data and proposes two novel techniques that not only enable WELL to be applied to big data but also lead to more accurate results.
Minding the Gaps in a Video Action Analysis Pipeline
TLDR
An event detection system composed of four modules: feature extraction, event proposal generation, event classification and event localization is presented, which shares many similarities with standard object detection pipelines.
Focal Visual-Text Attention for Memex Question Answering
TLDR
The MemexQA dataset is presented, the first publicly available multimodal question answering dataset consisting of real personal photo albums and an end-to-end trainable network that makes use of a hierarchical process to dynamically determine what media and what time to focus on in the sequential data to answer the question is proposed.
MSNet: A Multilevel Instance Segmentation Network for Natural Disaster Damage Assessment in Aerial Videos
TLDR
A new model, namely MSNet, is presented, which contains novel region proposal network designs and an unsupervised score refinement network for confidence score calibration in both bounding box and mask branches and achieves state-of-the-art results compared to previous methods in this dataset.
SimAug: Learning Robust Representations from Simulation for Trajectory Prediction
TLDR
A novel approach to learn robust representation through augmenting the simulation training data such that the representation can better generalize to unseen real-world test data.
...
1
2
3
4
...