An Action Is Worth Multiple Words: Handling Ambiguity in Action Recognition

@article{Kim2022AnAI,
  title={An Action Is Worth Multiple Words: Handling Ambiguity in Action Recognition},
  author={Kiyoon Kim and Davide Moltisanti and Oisin Mac Aodha and Laura Sevilla-Lara},
  journal={ArXiv},
  year={2022},
  volume={abs/2210.04933}
}
Precisely naming the action depicted in a video can be a challenging and oftentimes ambiguous task. In contrast to object instances represented as nouns (e.g. dog, cat, chair, etc.), in the case of actions, human annotators typically lack a consensus as to what constitutes a specific action (e.g. jogging versus running). In practice, a given video can contain multiple valid positive annotations for the same action. As a result, video datasets often contain significant levels of label noise and… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 47 REFERENCES

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

This paper presents a spatio-temporal action recognition model that is trained with only video-level labels, which are significantly easier to annotate and reports the first weakly-supervised results on the AVA dataset and state-of-the-art results among weakly -supervised methods on UCF101-24.

Semi-Supervised Action Recognition with Temporal Contrastive Learning

This work proposes a two-pathway temporal contrastive model using unlabeled videos at two different speeds lever-aging the fact that changing video speed does not change an action to maximize the similarity between encoded representations of the same video atTwo different speeds as well as minimize the similarities between different videos played at different speeds.

Learning to Learn from Noisy Web Videos

This work proposes a reinforcement learning-based formulation for selecting the right examples for training a classifier from noisy web search results and uses Q-learning to learn a data labeling policy on a small labeled training dataset, and then uses this to automatically label noisy web data for new visual concepts.

Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition

The primary empirical finding is that pre-training at a very large scale (over 65 million videos), despite on noisy social-media videos and hashtags, substantially improves the state-of-the-art on three challenging public action recognition datasets.

Learning Visual Actions Using Multiple Verb-Only Labels

It is demonstrated that multi-label verb-only representations outperform conventional single verb labels, and other benefits of a multi-verb representation including cross-dataset retrieval and verb type manner and result verb types retrieval.

Trespassing the Boundaries: Labeling Temporal Bounds for Object Interactions in Egocentric Video

It is demonstrated that disagreement stems from a limited understanding of the distinct phases of an action, and proposed annotating based on the Rubicon Boundaries, inspired by a similarly named cognitive model, for consistent temporal bounds of object interactions is proposed.

Learning in an Uncertain World: Representing Ambiguity Through Multiple Hypotheses

This work proposes a frame-work for reformulating existing single-prediction models as multiple hypothesis prediction (MHP) models and an associated meta loss and optimization procedure to train them, and finds that MHP models outperform their single-hypothesis counterparts in all cases and expose valuable insights into the variability of predictions.

Temporal Segment Networks for Action Recognition in Videos

The proposed TSN framework, called temporal segment network (TSN), aims to model long-range temporal structure with a new segment-based sampling and aggregation scheme and won the video classification track at the ActivityNet challenge 2016 among 24 teams.

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.

Multi-Label Learning from Single Positive Labels

This work considers the hardest version of this problem, where annotators provide only one relevant label for each image, and shows that in some cases it is possible to approach the performance of fully labeled classifiers despite training with significantly fewer confirmed labels.