• Publications
  • Influence
TSM: Temporal Shift Module for Efficient Video Understanding
TLDR
A generic and effective Temporal Shift Module (TSM) that can achieve the performance of 3D CNN but maintain 2D CNN’s complexity and is extended to online setting, which enables real-time low-latency online video recognition and video object detection.
Once for All: Train One Network and Specialize it for Efficient Deployment
TLDR
This work proposes to train a once-for-all (OFA) network that supports diverse architectural settings by decoupling training and search, to reduce the cost and propose a novel progressive shrinking algorithm, a generalized pruning method that reduces the model size across many more dimensions than pruning.
The Sound of Pixels
TLDR
Qualitative results suggest the PixelPlayer model learns to ground sounds in vision, enabling applications such as independently adjusting the volume of sound sources, and experimental results show that the proposed Mix-and-Separate framework outperforms several baselines on source separation.
Semantic Compositional Networks for Visual Captioning
  • Zhe Gan, Chuang Gan, L. Deng
  • Computer Science
    IEEE Conference on Computer Vision and Pattern…
  • 23 November 2016
TLDR
Experimental results show that the proposed method significantly outperforms prior state-of-the-art approaches, across multiple evaluation metrics.
The Neuro-Symbolic Concept Learner: Interpreting Scenes Words and Sentences from Natural Supervision
We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual concepts, words, and semantic parsing of sentences without explicit supervision on any of them; instead, our model
Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding
TLDR
This work proposes a neural-symbolic visual question answering system that first recovers a structural scene representation from the image and a program trace from the question, then executes the program on the scene representation to obtain an answer.
Temporal Shift Module for Efficient Video Understanding
TLDR
A generic and effective Temporal Shift Module (TSM) that can achieve the performance of 3D CNN but maintain 2D complexity and ranks the first on both Something-Something V1 and V2 leaderboards upon this paper’s submission.
CLEVRER: CoLlision Events for Video REpresentation and Reasoning
TLDR
This work introduces the CoLlision Events for Video REpresentation and Reasoning (CLEVRER), a diagnostic video dataset for systematic evaluation of computational models on a wide range of reasoning tasks, and evaluates various state-of-the-art models for visual reasoning on a benchmark.
Graph Convolutional Networks for Temporal Action Localization
TLDR
This paper builds an action proposal graph, where each proposal is represented as a node and their relations between two proposals as an edge and applies the GCNs over the graph to model the relations among different proposals and learn powerful representations for the action classification and localization.
Dense Regression Network for Video Grounding
TLDR
A novel dense regression network (DRN) is designed to regress the distances between the frame within the ground truth and the starting (ending) frame of the video segment described by the query to improve the video grounding accuracy.
...
...