NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification

@article{Lin2018NeXtVLADAE,
  title={NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification},
  author={Rongcheng Lin and Jing Xiao and Jianping Fan},
  journal={ArXiv},
  year={2018},
  volume={abs/1811.05014}
}
This paper introduces a fast and efficient network architecture, NeXtVLAD, to aggregate frame-level features into a compact feature vector for large-scale video classification. Briefly speaking, the basic idea is to decompose a high-dimensional feature into a group of relatively low-dimensional vectors with attention before applying NetVLAD aggregation over time. This NeXtVLAD approach turns out to be both effective and parameter efficient in aggregating temporal information. In the 2nd Youtube… Expand
BERT for Large-scale Video Segment Classification with Test-time Augmentation
TLDR
This paper presents the approach to the third YouTube-8M video understanding competition that challenges par-ticipants to localize video-level labels at scale to the pre-cise time in the video where the label actually occurs, and proposes test-time augmentation as shifting video frames to one left or right unit. Expand
Joint Learning of NNeXtVLAD, CNN and Context Gating for Micro-Video Venue Classification
TLDR
An end-to-end framework which jointly learns NNeXtVLAD, CNN layer, and context gating for micro-video venue classification is built up, which significantly outperforms the state-of-the-art baselines in terms of both Micro-F1 and Macro-f1 scores. Expand
Efficient Video Classification Using Fewer Frames
TLDR
This work focuses on building compute-efficient video classification models which process fewer frames and hence have less number of FLOPs and shows that in each of these cases, a see-it-all teacher can be used to train a compute efficient see-very-little student. Expand
A segment-level classification solution to the 3 rd YouTube-8 M Video Understanding Challenge
In the 3 YouTube-8M Video Understanding Challenge, datasets with video-level and segment-level annotation were provided for classifying video segments. Based on the winning solutions of the previousExpand
Temporal Localization of Video Topics Using the YT8M Dataset: An Exploration
Due to the progress made in computing resources and artificial intelligence, applications in computer vision have gained a lot of traction over the past decade. One such application applies to videoExpand
Noise Learning for Weakly Supervised Segment Classification in Video
This paper describes our solution for the 3rd YouTube8M video understanding challenge. The challenge of this year is different from the previous challenge. Given a large scale video dataset withExpand
Exploring the Consistency of Segment-level and Video-level Predictions for Improved Temporal Concept Localization in Videos
Compared with the previous video-level classification, the YouTube-8M video understanding challenge of 2019 mainly focuses on temporally localizing the entities from videos. Specifically,Expand
Boosting Up Segment-level Video Classification Performance with Label Correlation and Reweighting
This paper introduces a solution to the 3rd Youtube-8M video understanding challenge. The main focus of the solution is to analyze the label dependencies of multi-label videos and explore thisExpand
MOD: A Deep Mixture Model with Online Knowledge Distillation for Large Scale Video Temporal Concept Localization
TLDR
A deep mixture model with online knowledge distillation (MOD) for large-scale video temporal concept localization, which is ranked 3rd in the 3rd YouTube-8M Video Understanding Challenge, is presented and discussed. Expand
A Multimodal Framework for Video Ads Understanding
TLDR
A multimodal system to improve the ability of structured analysis of advertising video content, which achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 37 REFERENCES
Learnable pooling with Context Gating for video classification
TLDR
This work explores combinations of learnable pooling techniques such as Soft Bag-of-words, Fisher Vectors, NetVLAD, GRU and LSTM to aggregate video features over time, and introduces a learnable non-linear network unit, named Context Gating, aiming at modeling in-terdependencies between features. Expand
Fusing Multi-Stream Deep Networks for Video Classification
TLDR
A multi-stream framework is proposed to fully utilize the rich multimodal information in videos and it is demonstrated that the adaptive fusion method using the class relationship as a regularizer outperforms traditional alternatives that estimate the weights in a "free" fashion. Expand
ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
TLDR
A new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video and outperforms other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks. Expand
Delving Deeper into Convolutional Networks for Learning Video Representations
TLDR
A variant of the GRU model is introduced that leverages the convolution operations to enforce sparse connectivity of the model units and share parameters across the input spatial locations to mitigate the effect of low-level percepts on human action recognition and Video Captioning tasks. Expand
Large-Scale Video Classification with Convolutional Neural Networks
TLDR
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training. Expand
Beyond short snippets: Deep networks for video classification
TLDR
This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos. Expand
YouTube-8M: A Large-Scale Video Classification Benchmark
TLDR
YouTube-8M is introduced, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities, and various (modest) classification models are trained on the dataset. Expand
Convolutional Two-Stream Network Fusion for Video Action Recognition
TLDR
A new ConvNet architecture for spatiotemporal fusion of video snippets is proposed, and its performance on standard benchmarks where this architecture achieves state-of-the-art results is evaluated. Expand
Explorer Action Recognition with Dynamic Image Networks
We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis, particularly in combination with convolutional neural networks (CNNs). A dynamic imageExpand
Appearance-and-Relation Networks for Video Classification
  • Limin Wang, Wei Li, Wen Li, L. Gool
  • Computer Science
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner, constructed by stacking multiple generic building blocks, called SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner. Expand
...
1
2
3
4
...