Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data

@article{Rohrbach2015RecognizingFA,
  title={Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data},
  author={Marcus Rohrbach and Anna Rohrbach and Michaela Regneri and Sikandar Amin and Mykhaylo Andriluka and Manfred Pinkal and Bernt Schiele},
  journal={International Journal of Computer Vision},
  year={2015},
  volume={119},
  pages={346-373}
}
Activity recognition has shown impressive progress in recent years. However, the challenges of detecting fine-grained activities and understanding how they are combined into composite activities have been largely overlooked. In this work we approach both tasks and present a dataset which provides detailed annotations to address them. The first challenge is to detect fine-grained activities, which are defined by low inter-class variability and are typically characterized by fine-grained body… Expand
Multi-stream I3D Network for Fine-grained Action Recognition
  • Jian You, Ping Shi, Xiaojie Bao
  • Computer Science
  • 2018 IEEE 4th Information Technology and Mechatronics Engineering Conference (ITOEC)
  • 2018
TLDR
This method uses the I3D network, which has achieved great success in the area of coarse-grained action recognition, as the basic network architecture, and extracts the human pose and hand for obtaining local features of the fine- grained action. Expand
Few-shot Fine-Grained Action Recognition via Bidirectional Attention and Contrastive Meta-Learning
TLDR
The few-shot fine-grained action recognition problem is proposed, aiming to recognize novel fine- grained actions with only few samples given for each class, and contrastive meta-learning (CML) is introduced, which generates more discriminative video representations for low inter-class variance data. Expand
Actionness-pooled Deep-convolutional Descriptor for fine-grained action recognition
TLDR
The visual attention mechanism is introduced into the proposed descriptor, termed Actionness-pooled Deep-convolutional Descriptor (ADD), and instead of pooling features uniformly from the entire video, it aggregate features in sub-regions that are more likely to contain actions according to actionness maps. Expand
FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding
TLDR
FineGym is a new dataset built on top of gymnasium videos that provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy and systematically investigates different methods on this dataset and obtains a number of interesting findings. Expand
TVENet: Temporal variance embedding network for fine-grained action representation
TLDR
This paper constructs a fine-grained action dataset, i.e., Figure Skating, which can be used for end-to-end network training and presents a framework for the joint optimization of classification and similarity constraints, and proposes a temporal variance embedding network (TVENet) embedding temporal context variances into the feature embeddings during the joint network training. Expand
FineAction: A Fine-Grained Video Dataset for Temporal Action Localization
TLDR
A novel large-scale and fine-grained video dataset, coined as FineAction, that introduces new opportunities and challenges for temporal action localization, thanks to its distinct characteristics of fine action classes with rich diversity, dense annotations of multiple instances, and co-occurring actions of different classes. Expand
Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation
TLDR
This work proposes a model for action segmentation which combines low-level spatiotemporal features with a high-level segmental classifier and introduces an efficient constrained segmental inference algorithm for this model that is orders of magnitude faster than the current approach. Expand
Extracting Action Hierarchies from Action Labels and their Use in Deep Action Recognition
TLDR
The exploitation of this hierarchical organization of action classes in different levels of granularity improves the learning speed and overall performance of a range of baseline and mid-range deep architectures for human action recognition (HAR). Expand
Segmental Spatio-Temporal CNNs for Fine-grained Action Segmentation and Classification
TLDR
A new spatio-temporal CNN model for fine-grained action classification and segmentation is proposed, which combines a spatial CNN to represent objects in the scene and their spatial relationships; a temporal CNN that captures how object relationships within an action change over time; and a semi-Markov model that captures transitions from one action to another. Expand
Follow the Attention: Combining Partial Pose and Object Motion for Fine-Grained Action Detection
TLDR
This work introduces a framework for integrating human pose and object motion to both temporally detect and classify the activities in a fine-grained manner and empirically shows the capability of this approach by achieving state-of-the-art results on MERL shopping dataset. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 138 REFERENCES
Script Data for Attribute-Based Recognition of Composite Activities
TLDR
This paper leverage the fact that many human activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts to incorporate script data that delivers new variations of a composite activity or even to unseen composite activities. Expand
A database for fine grained activity detection of cooking activities
TLDR
A novel database of 65 cooking activities, continuously recorded in a realistic setting, is proposed, suggesting that fine-grained activities are more difficult to detect and the body model can help in those cases. Expand
YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition
TLDR
This paper presents a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object, and uses a Web-scale language model to ``fill in'' novel verbs. Expand
Towards Understanding Action Recognition
TLDR
It is found that high-level pose features greatly outperform low/mid level features, in particular, pose over time is critical, but current pose estimation algorithms are not yet reliable enough to provide this information. Expand
Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities
  • M. Ryoo, J. Aggarwal
  • Computer Science
  • 2009 IEEE 12th International Conference on Computer Vision
  • 2009
TLDR
A novel matching, spatio-temporal relationship match, which is designed to measure structural similarity between sets of features extracted from two videos, thereby enabling detection and localization of complex non-periodic activities. Expand
A combined pose, object, and feature model for action understanding
TLDR
This work presents a system that is able to recognize complex, fine-grained human actions involving the manipulation of objects in realistic action sequences by combining these elements in a single model that outperforms existing state of the art techniques on this dataset. Expand
The Action Similarity Labeling Challenge
TLDR
This paper presents a novel video database, the “Action Similarity LAbeliNg” (ASLAN) database, along with benchmark protocols, and makes the ASLAN database, benchmarks, and descriptor encodings publicly available to the research community. Expand
Poselet Key-Framing: A Model for Human Activity Recognition
TLDR
A new model for recognizing human actions that supports spatio-temporal localization and is insensitive to dropped frames or partial observations is developed and shows classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset. Expand
A selective spatio-temporal interest point detector for human action recognition in complex scenes
TLDR
This paper presents a new approach for STIP detection by applying surround suppression combined with local and temporal constraints, and introduces a novel vocabulary building strategy by combining spatial pyramid and vocabulary compression techniques, resulting in improved performance and efficiency. Expand
Video Event Understanding Using Natural Language Descriptions
TLDR
A topic-based semantic relatedness measure is introduced between a video description and an action and role label, and incorporated into a posterior regularization objective that matches the state-of-the-art method on the TRECVID-MED11 event kit, despite weaker supervision. Expand
...
1
2
3
4
5
...