Do less and achieve more: Training CNNs for action recognition utilizing action images from the Web

  title={Do less and achieve more: Training CNNs for action recognition utilizing action images from the Web},
  author={Shugao Ma and Sarah Adel Bargal and Jianming Zhang and Leonid Sigal and Stan Sclaroff},
  journal={Pattern Recognit.},

Figures and Tables from this paper

Attention Transfer from Web Images for Video Recognition

This work proposes a novel approach to transfer knowledge from image domain to video domain, and designs a novel Siamese EnergyNet structure to learn energy functions on the attention maps by jointly optimizing two loss functions, such that the attention map corresponding to a ground truth concept would have higher energy.

Deep Image-to-Video Adaptation and Fusion Networks for Action Recognition

This paper proposes a novel method, named Deep Image-to-Video Adaptation and Fusion Networks (DIVAFN), to enhance action recognition in videos by transferring knowledge from images using video keyframes as a bridge and outperforms some state-of-the-art domain adaptation and action recognition methods.

Learning from Weakly-labeled Web Videos via Exploring Sub-Concepts

This work proposes to provide fine-grained supervision signals by defining the concept of Sub-Pseudo Label (SPL), which spans out a new set of meaningful "middle ground" label space constructed by extrapolating the original weak labels during video querying and the prior knowledge distilled from a teacher model.

Exploiting Images for Video Recognition: Heterogeneous Feature Augmentation via Symmetric Adversarial Learning

A novel symmetric adversarial learning approach for heterogeneous image-to-video adaptation, which augments deep image and video features by learning domain-invariant representations of source images and target videos with superior transformability and distinguishability is proposed.

Deep CNN Object Features for Improved Action Recognition in Low Quality Videos

Experimental results on low quality versions of two publicly available datasets showed that the use of CNN object features together with conventional shape and motion can greatly improve the performance of action recognition in low quality videos.

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

On HACS Segments, the state-of-the-art methods of action proposal generation and action localization are evaluated, and the new challenges posed by the dense temporal annotations are highlighted.

Unsupervised Learning of Human Action Categories in Still Images with Deep Representations

A novel method to deal with the problem that unsupervised discriminative deep representations are difficult to learn by alternately updating convolutional neural network parameters and the surrogate training dataset in an iterative manner is proposed.

CycDA: Unsupervised Cycle Domain Adaptation to Learn from Image to Video

CycDA is proposed, a cycle-based approach for unsupervised image-to-video domain adaptation that leverages the joint spatial information in images and videos and trains an independent spatio-temporal model to bridge the modality gap.

Still Image Action Recognition by Predicting Spatial-Temporal Pixel Evolution

  • M. SafaeiH. Foroosh
  • Computer Science
    2019 IEEE Winter Conference on Applications of Computer Vision (WACV)
  • 2019
A novel image representation domain, Ranked Saliency Map and Predicted Optical Flow or Rank_SM-POF for short, which captures both actor appearance and the future movement patterns of the actor through capturing the temporal ordering of each pixel by training a linear ranking machine on the predicted tensor of spatial-temporal representation of images.

CycDA: Unsupervised Cycle Domain Adaptation from Image to Video

CycDA is proposed, a cycle-based approach for unsupervised image-to-video domain adaptation that leverages the joint spatial information in images and videos and trains an independent spatio-temporal model to bridge the modality gap.



Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images

This work proposes a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output that are used to train action recognition models with long short-term memory networks.

Web-Based Classifiers for Human Action Recognition

The idea is to use images collected from the Web to learn representations of actions and leverage this knowledge to automatically annotate actions in videos, and to use “ordered pose pairs” (OPP) for encoding the temporal ordering of poses in the action model.

Two-Stream Convolutional Networks for Action Recognition in Videos

This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data.

What If We Do Not have Multiple Videos of the Same Action? — Video Action Localization Using Web Images

  • Waqas SultaniM. Shah
  • Computer Science
    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2016
This paper tackles the problem of spatio-temporal action localization in a video, without assuming the availability of multiple videos or any prior annotations, and proposes to reconstruct action proposals in the video by leveraging theaction proposals in images.

Learning realistic human actions from movies

A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset.

Video Annotation via Image Groups from the Web

A novel Group-based Domain Adaptation (GDA) learning framework to leverage different groups of knowledge queried from the Web image search engine to consumer videos (target domain) and demonstrates the effectiveness of leveraging grouped knowledge gained from Web images for video annotation.

3D Convolutional Neural Networks for Human Action Recognition

A novel 3D CNN model for action recognition that extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames.

Large-Scale Video Classification with Convolutional Neural Networks

This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots

Applying an approach to learn action categories from static images that leverages prior observations of generic human motion to augment its training process, it enhances a state-of-the-art technique when very few labeled training examples are available.

Beyond short snippets: Deep networks for video classification

This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.