Learn More
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we utilize deep architectures to learn discriminative(More)
Bag of Visual Words model (BoVW) with local features has become the most popular method in action recognition and obtained the state-of-the-art performance on several realistic datasets, such as the HMDB51, UCF50, and UCF101. BoVW yields a general pipeline to construct a global representation from a set of local features, which is mainly composed of five(More)
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet archi-tectures for action recognition in videos and learn these models given limited(More)
Representation of video is a vital problem in action recognition. This paper proposes Stacked Fisher Vectors (SFV), a new representation with multi-layer nested Fisher vector encoding, for action recognition. In the first layer, we densely sample large subvolumes from input videos, extract local features, and encode them using Fisher vectors (FVs). The(More)
Deep convolutional networks have achieved great success for object recognition in still images. However, for action recognition in videos, the improvement of deep convo-lutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets [12]) are(More)
The initial desertification in the Asian interior is thought to be one of the most prominent climate changes in the Northern Hemisphere during the Cenozoic era. But the dating of this transition is uncertain, partly because desert sediments are usually scattered, discontinuous and difficult to date. Here we report nearly continuous aeolian deposits covering(More)
Designing effective features is a fundamental problem in computer vision. However, it is usually difficult to achieve a great tradeoff between discriminative power and robustness. Previous works shown that spatial co-occurrence can boost the discriminative power of features. However the current existing co-occurrence features are taking few considerations(More)
Images and videos are often characterized by multiple types of local descriptors such as SIFT, HOG and HOF, each of which describes certain aspects of object feature. Recognition systems benefit from fusing multiple types of these descriptors. Two widely applied fusion pipelines are descriptor concatenation and kernel average. The first one is effective(More)
Phoneme segmentation is a fundamental problem in many speech recognition and synthesis studies. Unsupervised phoneme segmentation assumes no knowledge on linguistic contents and acoustic models, and thus poses a challenging problem. The essential question here is what is the optimal segmentation. This paper formulates the optimal segmentation problem into a(More)