Alleviating Over-segmentation Errors by Detecting Action Boundaries

@article{Ishikawa2021AlleviatingOE,
  title={Alleviating Over-segmentation Errors by Detecting Action Boundaries},
  author={Yuchi Ishikawa and Seito Kasai and Yoshimitsu Aoki and Hirokatsu Kataoka},
  journal={2021 IEEE Winter Conference on Applications of Computer Vision (WACV)},
  year={2021},
  pages={2321-2330}
}
We propose an effective framework for the temporal action segmentation task, namely an Action Segment Refinement Framework (ASRF). Our model architecture consists of a long-term feature extractor and two branches: the Action Segmentation Branch (ASB) and the Boundary Regression Branch (BRB). The long-term feature extractor provides shared features for the two branches with a wide temporal receptive field. The ASB classifies video frames with action classes, while the BRB regresses the action… 

Figures and Tables from this paper

Refining Action Segmentation with Hierarchical Video Representations
TLDR
HASR can be plugged into various action segmentation models (MS-TCN, SSTDA, ASRF), and improve the performance of state-of-the-art models based on three challenging datasets (GTEA, 50Salads, and Breakfast).
ASFormer: Transformer for Action Segmentation
TLDR
An efficient Transformer-based model for action segmentation task, named ASFormer, with three distinctive characteristics, which constrains the hypothesis space within a reliable scope, and is beneficial for the action segmentations task to learn a proper target function with small training sets.
Temporal Action Segmentation with High-level Complex Activity Labels
TLDR
A novel action discovery framework that automatically discovers constituent actions in videos with the activity classification task that is able to generalize the Hungarian matching settings from the current video and activity level to the global level.
Temporal Action Segmentation from Timestamp Supervision
TLDR
This paper uses the model output and the annotated timestamps to generate frame-wise labels by detecting the action changes, and introduces a confidence loss that forces the predicted probabilities to monotonically decrease as the distance to the timestamp increases.
Coarse to Fine Multi-Resolution Temporal Convolutional Network
TLDR
A novel temporal encoder-decoder to tackle the problem of sequence fragmentation by following a coarse-to-fine structure with an implicit ensemble of multiple temporal resolutions, which produces smoother segmentations that are more accurate and bettercalibrated, bypassing the need for additional refinement modules.
Refining Action Segmentation with Hierarchical Video Representations -Supplementary Material-
In this supplementary material, we show additional qualitative results that could not be shown on the original manuscript due to the page limit. In addition, we compare our method with Graph-based
Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos
TLDR
This paper reformulates the individual action labels as integrated text prompts for super-vision, which bridge the gap between individual action semantics, and proposes a prompt-based framework, Bridge-Prompt, to model the semantics across adjacent actions, so that it simultaneously exploits both out-of-context and contextual information from a series of ordinal actions in instructional videos.
FIFA: Fast Inference Approximation for Action Segmentation
TLDR
FIFA is a general approach that can replace exact inference, improving its speed by more than 5 times while maintaining its performance, and achieves state-of-the-art results for most metrics on two action segmentation datasets.
On the Pitfalls of Batch Normalization for End-to-End Video Learning: A Study on Surgical Workflow Analysis
TLDR
It is argued that BN’s properties create major obstacles for training CNNs and temporal models end to end in video tasks, and it is shown that even simple, endto-end CNN-LSTMs can outperform the state of the art when CNNs without BN are used.
Overview of Tencent Multi-modal Ads Video Understanding
TLDR
An overview of the video structuring task in the multi-modal Ads Video Understanding Challenge is presented, including the background of ads videos, an elaborate description of this task, the proposed dataset, the evaluation protocol, and the baseline model.
...
1
2
...

References

SHOWING 1-10 OF 51 REFERENCES
Temporal Action Detection with Structured Segment Networks
TLDR
The structured segment network (SSN) is presented, a novel framework which models the temporal structure of each action instance via a structured temporal pyramid and introduces a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness.
MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation
TLDR
A multi-stage architecture for the temporal action segmentation task that achieves state-of-the-art results on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.
Cascaded Boundary Regression for Temporal Action Detection
TLDR
A two-stage temporal action detection pipeline with Cascaded Boundary Regression (CBR) model, which uses temporal coordinate regression to refine the temporal boundaries of the sliding windows to achieve state-of-the-art performance on both datasets.
Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment
  • Li Ding, Chenliang Xu
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
A novel action modeling framework is proposed, which consists of a new temporal convolutional network, named Temporal Convolutional Feature Pyramid Network (TCFPN), for predicting frame-wise action labels, and a novel training strategy for weakly-supervised sequence modeling, named Iterative Soft Boundary Assignment (ISBA), to align action sequences and update the network in an iterative fashion.
Action Segmentation with Mixed Temporal Domain Adaptation
TLDR
Mixed Temporal Domain Adaptation is proposed to jointly align frame-and video-level embedded feature spaces across domains, and further integrate with the domain attention mechanism to focus on aligning the frame-level features with higher domain discrepancy, leading to more effective domain adaptation.
Segmental Spatiotemporal CNNs for Fine-Grained Action Segmentation
TLDR
This work proposes a model for action segmentation which combines low-level spatiotemporal features with a high-level segmental classifier and introduces an efficient constrained segmental inference algorithm for this model that is orders of magnitude faster than the current approach.
TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals
TLDR
A novel Temporal Unit Regression Network (TURN) model, which jointly predicts action proposals and refines the temporal boundaries by temporal coordinate regression, and outperforms state-of-the-art performance on THUMOS-14 and ActivityNet datasets.
Single Shot Temporal Action Detection
TLDR
This work proposes a novel Single Shot Action Detector (SSAD) network based on 1D temporal convolutional layers to skip the proposal generation step via directly detecting action instances in untrimmed video and empirically investigates into input feature types and fusion strategies to further improve detection accuracy.
Action Segmentation With Joint Self-Supervised Temporal Domain Adaptation
TLDR
SelfSupervised Temporal Domain Adaptation (SSTDA), which contains two self-supervised auxiliary tasks (binary and sequential domain prediction) to jointly align cross-domain feature spaces embedded with local and global temporal dynamics, achieving better performance than other Domainadaptation (DA) approaches.
Improving Action Segmentation via Graph-Based Temporal Reasoning
TLDR
A network module called Graph-based Temporal Reasoning Module (GTRM) that can be built on top of existing action segmentation models to learn the relation of multiple action segments in various time spans is proposed.
...
1
2
3
4
5
...