Weakly Supervised Energy-Based Learning for Action Segmentation

@article{Li2019WeaklySE,
  title={Weakly Supervised Energy-Based Learning for Action Segmentation},
  author={Jun Li and Peng Lei and Sinisa Todorovic},
  journal={2019 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2019},
  pages={6242-6250}
}
This paper is about labeling video frames with action classes under weak supervision in training, where we have access to a temporal ordering of actions, but their start and end frames in training videos are unknown. Following prior work, we use an HMM grounded on a Gated Recurrent Unit (GRU) for frame labeling. Our key contribution is a new constrained discriminative forward loss (CDFL) that we use for training the HMM and GRU under weak supervision. While prior work typically estimates the… Expand
Set-Constrained Viterbi for Set-Supervised Action Segmentation
  • J. Li, S. Todorovic
  • Computer Science
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
TLDR
This paper specifies an HMM, which accounts for co-occurrences of action classes and their temporal lengths, and explicitly training the HMM on a Viterbi-based loss, and introduces a new regularization of feature affinities between training videos that share the same action classes. Expand
Set-Constrained Viterbi for Set-Supervised Action Segmentation
This paper is about weakly supervised action segmentation, where the ground truth specifies only a set of actions present in a training video, but not their true temporal ordering. Prior workExpand
Anchor-Constrained Viterbi for Set-Supervised Action Segmentation
TLDR
A Hidden Markov Model grounded on a multilayer perceptron (MLP) is used to label video frames, and thus a pseudo-ground truth is generated for the subsequent pseudo-supervised training of action segmentation under weak supervision in training. Expand
Action Shuffle Alternating Learning for Unsupervised Action Segmentation
TLDR
This paper addresses unsupervised action segmentation with a new self-supervised learning (SSL) of a feature embedding that accounts for both frame and action-level structure of videos. Expand
SCT: Set Constrained Temporal Transformer for Set Supervised Action Segmentation
TLDR
This work assumes that for each training video only the list of actions is given that occur in the video, but not when, how often, and in which order they occur, and proposes an approach that can be trained end-to-end on such data. Expand
Weakly Supervised Action Segmentation Using Mutual Consistency
TLDR
This paper proposes a new approach for weakly supervised action segmentation based on a two branch network that achieves state-of-the-art results foraction segmentation and action alignment while being fully differentiable and faster to train since it does not require a costly alignment step during training. Expand
Unsupervised Action Segmentation with Self-supervised Feature Learning and Co-occurrence Parsing
TLDR
CAP is developed, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub- actions in an accurate and general way. Expand
Temporal Action Segmentation from Timestamp Supervision
TLDR
This paper uses the model output and the annotated timestamps to generate frame-wise labels by detecting the action changes, and introduces a confidence loss that forces the predicted probabilities to monotonically decrease as the distance to the timestamp increases. Expand
Temporal Action Segmentation with High-level Complex Activity Labels
TLDR
A novel action discovery framework that automatically discovers constituent actions in videos with the activity classification task that is able to generalize the Hungarian matching settings from the current video and activity level to the global level. Expand
A Cluster-Based Method for Action Segmentation Using Spatio-Temporal and Positional Encoded Embeddings
TLDR
This work proposes a novel action segmentation method that requires no prior video analysis and no annotated data, and produces competitive results on Breakfast and Inria Instructional Videos dataset benchmarks. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 26 REFERENCES
Weakly Supervised Action Learning with RNN Based Fine-to-Coarse Modeling
TLDR
A combination of a discriminative representation of subactions, modeled by a recurrent neural network, and a coarse probabilistic model to allow for a temporal alignment and inference over long sequences of human actions is proposed. Expand
Connectionist Temporal Modeling for Weakly Supervised Action Labeling
TLDR
The Extended Connectionist Temporal Classification (ECTC) framework is introduced to efficiently evaluate all possible alignments via dynamic programming and explicitly enforce their consistency with frame-to-frame visual similarities. Expand
Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment
  • Li Ding, Chenliang Xu
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
A novel action modeling framework is proposed, which consists of a new temporal convolutional network, named Temporal Convolutional Feature Pyramid Network (TCFPN), for predicting frame-wise action labels, and a novel training strategy for weakly-supervised sequence modeling, named Iterative Soft Boundary Assignment (ISBA), to align action sequences and update the network in an iterative fashion. Expand
Action Sets: Weakly Supervised Action Segmentation Without Ordering Constraints
TLDR
This work introduces a system that automatically learns to temporally segment and label actions in a video, where the only supervision that is used are action sets. Expand
Weakly Supervised Action Labeling in Videos under Ordering Constraints
TLDR
It is shown that the action label assignment can be determined together with learning a classifier for each action in a discriminative manner and evaluated on a new and challenging dataset of 937 video clips. Expand
Weakly supervised learning of actions from transcripts
TLDR
The proposed system is able to align the scripted actions with the video data, that the learned models localize and classify actions in the datasets, and that they outperform any current state-of-the-art approach for aligning transcripts with video data. Expand
Temporal Deformable Residual Networks for Action Segmentation in Videos
  • Peng Lei, S. Todorovic
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
A new model - temporal deformable residual network (TDRN) - aimed at analyzing video intervals at multiple temporal scales for labeling video frames demonstrates that TDRN outperforms the state of the art in frame-wise segmentation accuracy, segmental edit score, and segmental overlap F1 score. Expand
D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation
TLDR
The proposed Discriminative Differentiable Dynamic Time Warping (D3TW) innovatively solves sequence alignment with discriminative modeling and end-to-end training, which substantially improves the performance in weakly supervised action alignment and segmentation tasks. Expand
Temporal Convolutional Networks for Action Segmentation and Detection
TLDR
A class of temporal models that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection, which are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. Expand
NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning
TLDR
This work proposes a novel learning algorithm with a Viterbi-based loss that allows for online and incremental learning of weakly annotated video data and shows that explicit context and length modeling leads to huge improvements in video segmentation and labeling tasks. Expand
...
1
2
3
...