• Publications
  • Influence
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
TLDR
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.
Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks
TLDR
This paper devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 x3 x 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time.
Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition
TLDR
A novel recurrent attention convolutional neural network (RA-CNN) which recursively learns discriminative region attention and region-based feature representation at multiple scales in a mutual reinforced way and achieves the best performance in three fine-grained tasks.
A Deep Learning-Based Approach to Progressive Vehicle Re-identification for Urban Surveillance
TLDR
This paper proposes a novel deep learning-based approach to PROgressive Vehicle re-ID, called “PROVID”, which treats vehicle Re-Id as two specific progressive search processes: coarse-to-fine search in the feature space, and near- to-distantsearch in the real world surveillance environment.
Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition
TLDR
This paper proposes a novel part learning approach by a multi-attention convolutional neural network (MA-CNN), where part generation and feature learning can reinforce each other, and shows the best performances on three challenging published fine-grained datasets.
Exploring Visual Relationship for Image Captioning
TLDR
This paper introduces a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework that novelly integrates both semantic and spatial object relationships into image encoder.
Jointly Modeling Embedding and Translation to Bridge Video and Language
TLDR
A novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual- semantic embedding and outperforms several state-of-the-art techniques in predicting Subject-Verb-Object (SVO) triplets.
Boosting Image Captioning with Attributes
TLDR
This paper presents Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner.
Multiview Spectral Embedding
TLDR
A new spectral-embedding algorithm, namely, multiview spectral embedding (MSE), which can encode different features in different ways, to achieve a physically meaningful embedding and explores the complementary property of different views.
PROVID: Progressive and Multimodal Vehicle Reidentification for Large-Scale Urban Surveillance
TLDR
This paper proposes PROVID, a PROgressive Vehicle re-IDentification framework based on deep neural networks, which not only utilizes the multimodality data in large-scale video surveillance, such as visual features, license plates, camera locations, and contextual information, but also considers vehicle reidentification in two progressive procedures: coarse- to-fine search in the feature domain, and near-to-distantsearch in the physical space.
...
1
2
3
4
5
...