Clockwork Convnets for Video Semantic Segmentation

@inproceedings{Shelhamer2016ClockworkCF,
  title={Clockwork Convnets for Video Semantic Segmentation},
  author={Evan Shelhamer and Kate Rakelly and Judy Hoffman and Trevor Darrell},
  booktitle={ECCV Workshops},
  year={2016}
}
Recent years have seen tremendous progress in still-image segmentation; however the naive application of these state-of-the-art algorithms to every video frame requires considerable computation and ignores the temporal continuity inherent in video. We propose a video recognition framework that relies on two key observations: 1) while pixels may change rapidly from frame to frame, the semantic content of a scene evolves more slowly, and 2) execution can be viewed as an aspect of architecture… 
Distortion-Aware Network Pruning and Feature Reuse for Real-time Video Segmentation
TLDR
This paper proposes a novel framework to speed up any architecture with skip-connections for real-time vision tasks by exploiting the temporal locality in videos, and validates the Spatial-Temporal Mask Generator (STMG) on video semantic segmentation benchmarks with multiple backbone networks.
Low-Latency Video Semantic Segmentation
TLDR
A framework for video semantic segmentation is developed, which incorporates two novel components: a feature propagation module that adaptively fuses features over time via spatially variant convolution, thus reducing the cost of per-frame computation; and an adaptive scheduler that dynamically allocate computation based on accuracy prediction.
Low-Latency Video Semantic Segmentation
TLDR
A framework for video semantic segmentation is developed, which incorporates two novel components: a feature propagation module that adaptively fuses features over time via spatially variant convolution, thus reducing the cost of per-frame computation and an adaptive scheduler that dynamically allocate computation based on accuracy prediction.
Efficient Semantic Video Segmentation with Per-frame Inference
TLDR
This work process efficient semantic video segmentation in a per-frame fashion during the inference process, and explicitly considers the temporal consistency among frames as extra constraints during the training process and embed theporal consistency into the segmentation network.
Architecture Search of Dynamic Cells for Semantic Video Segmentation
TLDR
This work proposes a neural architecture search solution, where the choice of operations together with their sequential arrangement are being predicted by a separate neural network, and shows that such generalisation leads to stable and accurate results across common benchmarks, such as CityScapes and CamVid datasets.
Learning Dynamic Network Using a Reuse Gate Function in Semi-supervised Video Object Segmentation
TLDR
A novel dynamic network is proposed that estimates change across frames and decides which path to choose – computing a full network or reusing previous frame’s feature – to choose depending on the expected similarity.
CBinfer: Exploiting Frame-to-Frame Locality for Faster Convolutional Network Inference on Video Streams
  • L. CavigelliL. Benini
  • Computer Science
    IEEE Transactions on Circuits and Systems for Video Technology
  • 2020
TLDR
This work adopts an orthogonal viewpoint and proposes a novel algorithm exploiting the spatio-temporal sparsity of pixel changes that resulted in an average speed-up of 9.1X over cuDNN on the Tegra X2 platform at a negligible accuracy loss and a lower power consumption.
Efficient Video Semantic Segmentation with Labels Propagation and Refinement
TLDR
The proposed Efficient Video Segmentation (EVS) pipeline achieves accuracy levels competitive to the existing real-time methods for semantic image segmentation (mIoU above 60%), while achieving much higher frame rates.
Temporally Distributed Networks for Fast Video Semantic Segmentation
We present TDNet, a temporally distributed network designed for fast and accurate video semantic segmentation. We observe that features extracted from a certain high-level layer of a deep CNN can be
FASSVid: Fast and Accurate Semantic Segmentation for Video Sequences
TLDR
This paper leverages a previous input frame as well as the previous output of the network to enhance the prediction accuracy of the current input frame of the video stream, and proposes a network, entitled FASSVid, which improves the mIoU accuracy performance over a standard non-sequential baseline model.
...
...

References

SHOWING 1-10 OF 37 REFERENCES
Supervoxel-Consistent Foreground Propagation in Video
TLDR
This work proposes a higher order supervoxel label consistency potential for semi-supervised foreground segmentation, leveraging bottom-up supervoxels to guide its estimates towards long-range coherent regions.
Efficient hierarchical graph-based video segmentation
TLDR
An efficient and scalable technique for spatiotemporal segmentation of long video sequences using a hierarchical graph-based algorithm that generates high quality segmentations, which are temporally coherent with stable region boundaries, and allows subsequent applications to choose from varying levels of granularity.
Weakly Supervised Multiclass Video Segmentation
TLDR
This paper presents a novel nearest neighbor-based label transfer scheme for weakly supervised video segmentation, which finds a semantically meaningful label for every pixel in a video.
Evaluation of super-voxel methods for early video processing
TLDR
Five supervoxel algorithms are studied in the context of what is considered to be a good supervoxels: namely, spatiotemporal uniformity, object/region boundary detection, region compression and parsimony, leading to conclusive evidence that the hierarchical graph-based and segmentation by weighted aggregation methods perform best and almost equally-well on nearly all the metrics.
Long-term recurrent convolutional networks for visual recognition and description
TLDR
A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Convolutional neural networks at constrained time cost
  • Kaiming HeJian Sun
  • Computer Science
    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2015
TLDR
This paper investigates the accuracy of CNNs under constrained time cost, and presents an architecture that achieves very competitive accuracy in the ImageNet dataset, yet is 20% faster than “AlexNet” [14] (16.0% top-5 error, 10-view test).
Fast Object Segmentation in Unconstrained Video
TLDR
This method is fast, fully automatic, and makes minimal assumptions about the video, which enables handling essentially unconstrained settings, including rapidly moving background, arbitrary object motion and appearance, and non-rigid deformations and articulations.
Speeding up Convolutional Neural Networks with Low Rank Expansions
TLDR
Two simple schemes for drastically speeding up convolutional neural networks are presented, achieved by exploiting cross-channel or filter redundancy to construct a low rank basis of filters that are rank-1 in the spatial domain.
Large-Scale Video Classification with Convolutional Neural Networks
TLDR
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Learning object class detectors from weakly annotated video
TLDR
It is shown that training from a combination of weakly annotated videos and fully annotated still images using domain adaptation improves the performance of a detector trained from still images alone.
...
...