SlowFast Networks for Video Recognition
- Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He
- Computer ScienceIEEE International Conference on Computer Vision
- 10 December 2018
This work presents SlowFast networks for video recognition, which achieves strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by the SlowFast concept.
A ConvNet for the 2020s
- Zhuang Liu, Hanzi Mao, Chaozheng Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie
- Computer ScienceComputer Vision and Pattern Recognition
- 10 January 2022
This work gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discovers several key components that contribute to the performance difference along the way, leading to a family of pure ConvNet models dubbed ConvNeXt.
Convolutional Two-Stream Network Fusion for Video Action Recognition
- Christoph Feichtenhofer, A. Pinz, Andrew Zisserman
- Computer ScienceComputer Vision and Pattern Recognition
- 22 April 2016
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated.
3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training
- Dario Pavllo, Christoph Feichtenhofer, David Grangier, Michael Auli
- Computer ScienceComputer Vision and Pattern Recognition
- 28 November 2018
In this work, we demonstrate that 3D poses in video can be effectively estimated with a fully convolutional model based on dilated temporal convolutions over 2D keypoints. We also introduce…
X3D: Expanding Architectures for Efficient Video Recognition
- Christoph Feichtenhofer
- Computer ScienceComputer Vision and Pattern Recognition
- 9 April 2020
This paper presents X3D, a family of efficient video networks that progressively expand a tiny 2D image classification architecture along multiple network axes, in space, time, width and depth, finding that networks with high spatiotemporal resolution can perform well, while being extremely light in terms of network width and parameters.
Multiscale Vision Transformers
- Haoqi Fan, Bo Xiong, Christoph Feichtenhofer
- Computer ScienceIEEE International Conference on Computer Vision
- 22 April 2021
This fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10× more costly in computation and parameters is evaluated.
Spatiotemporal Residual Networks for Video Action Recognition
- Christoph Feichtenhofer, A. Pinz, R. Wildes
- Computer ScienceNIPS
- 7 November 2016
The novel spatiotemporal ResNet is introduced and evaluated using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.
Detect to Track and Track to Detect
- Christoph Feichtenhofer, A. Pinz, Andrew Zisserman
- Computer ScienceIEEE International Conference on Computer Vision
- 11 October 2017
This paper sets up a ConvNet architecture for simultaneous detection and tracking, using a multi-task objective for frame-based object detection and across-frame track regression, and introduces correlation features that represent object co-occurrences across time to aid the ConvNet during tracking.
Long-Term Feature Banks for Detailed Video Understanding
- Chao-Yuan Wu, Christoph Feichtenhofer, Haoqi Fan, Kaiming He, Philipp Krähenbühl, Ross B. Girshick
- Computer ScienceComputer Vision and Pattern Recognition
- 12 December 2018
This paper proposes a long-term feature bank—supportive information extracted over the entire span of a video—to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds.
Masked Feature Prediction for Self-Supervised Visual Pre-Training
- Chen Wei, Haoqi Fan, Saining Xie, Chaoxia Wu, A. Yuille, Christoph Feichtenhofer
- Computer ScienceComputer Vision and Pattern Recognition
- 16 December 2021
This work presents Masked Feature Prediction (MaskFeat), a self-supervised pre-training of video models that randomly masks out a portion of the input sequence and then predicts the feature of the masked regions, and finds Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency.
...
...