Learn More
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we utilize deep architectures to learn discriminative(More)
Bag of Visual Words model (BoVW) with local features has become the most popular method in action recognition and obtained the state-of-theart performance on several realistic datasets, such as the HMDB51, UCF50, and UCF101. BoVW yields a general pipeline to construct a global representation from a set of local features, which is mainly composed of five(More)
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited(More)
Convolutional neural networks (CNNs) have been widely used in computer vision community, significantly improving the stateof-the-art. In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model. In order to enhance the discriminative power of the deeply learned features, this paper proposes a new(More)
Representation of video is a vital problem in action recognition. This paper proposes Stacked Fisher Vectors (SFV), a new representation with multi-layer nested Fisher vector encoding, for action recognition. In the first layer, we densely sample large subvolumes from input videos, extract local features, and encode them using Fisher vectors (FVs). The(More)
Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convolutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets [12]) are(More)
Designing effective features is a fundamental problem in computer vision. However, it is usually difficult to achieve a great tradeoff between discriminative power and robustness. Previous works shown that spatial co-occurrence can boost the discriminative power of features. However the current existing co-occurrence features are taking few considerations(More)
Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations, and occlusions. Recent studies show that deep learning approaches can achieve impressive performance on these two tasks. In this letter, we propose a deep cascaded multitask framework that exploits the inherent correlation between detection and(More)
Maximally Stable Extremal Regions (MSERs) have achieved great success in scene text detection. However, this low-level pixel operation inherently limits its capability for handling complex text information efficiently (e. g. connections between text or background components), leading to the difficulty in distinguishing texts from background components. In(More)
This paper proposes motionlet, a mid-level and spatiotemporal part, for human motion recognition. Motion let can be seen as a tight cluster in motion and appearance space, corresponding to the moving process of different body parts. We postulate three key properties of motion let for action recognition: high motion saliency, multiple scale representation,(More)