VideoBERT: A Joint Model for Video and Language Representation Learning

@article{Sun2019VideoBERTAJ,
  title={VideoBERT: A Joint Model for Video and Language Representation Learning},
  author={Chen Sun and Austin Myers and Carl Vondrick and Kevin P. Murphy and Cordelia Schmid},
  journal={2019 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2019},
  pages={7463-7472}
}
  • Chen Sun, Austin Myers, C. Schmid
  • Published 3 April 2019
  • Computer Science
  • 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube. [] Key Method In particular, inspired by its recent success in language modeling, we build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively.

Figures and Tables from this paper

LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training
TLDR
This work proposes a novel learning approach based on contrastive learning, LAVA, which is capable of learning joint language, audio, and video representations in a self-supervised manner and demonstrates that LAVA performs competitively with the current state-of-the-art self- supervised and weakly-super supervised pre-training techniques on UCF-101 and HMDB-51 video action recognition while using a fraction of the unlabeled data.
Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling
TLDR
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that CLIPBERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end- to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full- length videos, proving the proverbial less-is-more principle.
Self-Supervised Learning for Videos: A Survey
TLDR
This survey provides a review of existing approaches on self-supervised learning focusing on the video domain, and summarizes these methods into three different categories based on their learning objectives: 1) pre-text tasks, 2) generative modeling, and 3) contrastive learning.
Flamingo: a Visual Language Model for Few-Shot Learning
TLDR
It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.
Temporal Contrastive Pretraining for Video Action Recognition
TLDR
A self-supervised method for video representation learning based on Contrastive Predictive Coding (CPC) is proposed and it is demonstrated experimentally that the representations learned by the network are useful for action recognition.
Video-Text Representation Learning via Differentiable Weak Temporal Alignment
TLDR
This paper proposes a novel multi-modal self-supervised framework, VT-TWINS, to capture significant information from noisy and weakly correlated data using a variant of Dynamic Time Warping (DTW), and applies a contrastive learning scheme to learn feature representations on weakly correlation data.
Active Contrastive Learning of Audio-Visual Video Representations
TLDR
An active contrastive learning approach that builds an actively sampled dictionary with diverse and informative items, which improves the quality of negative samples and improves performances on tasks where there is high mutual information in the data, e.g., video classification.
ActBERT: Learning Global-Local Video-Text Representations
  • Linchao Zhu, Yi Yang
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
TLDR
This paper introduces ActBERT for self-supervised learning of joint video-text representations from unlabeled data and introduces an ENtangled Transformer block to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions.
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
TLDR
This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
Learning Visual Representations with Caption Annotations
TLDR
It is argued that captioned images are easily crawlable and can be exploited to supervise the training of visual representations, and proposed hybrid models, with dedicated visual and textual encoders, show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks.
...
...

References

SHOWING 1-10 OF 41 REFERENCES
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose.
SoundNet: Learning Sound Representations from Unlabeled Video
TLDR
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
Anticipating Visual Representations from Unlabeled Video
TLDR
This work presents a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects and applies recognition algorithms on the authors' predicted representation to anticipate objects and actions.
Generating Videos with Scene Dynamics
TLDR
A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Deep multi-scale video prediction beyond mean square error
TLDR
This work trains a convolutional network to generate future frames given an input sequence and proposes three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function.
Neural Baby Talk
TLDR
A novel framework for image captioning that can produce natural language explicitly grounded in entities that object detectors find in the image is introduced and reaches state-of-the-art on both COCO and Flickr30k datasets.
Weakly-Supervised Video Object Grounding from Text by Loss Weighting and Object Interaction
TLDR
A weakly-supervised video object grounding model that propagates the weak supervisory signal from the segment level to frames that likely contain the target object and uses the interactions among objects as a textual guide for the grounding.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Stochastic Variational Video Prediction
TLDR
This paper develops a stochastic variational video prediction (SV2P) method that predicts a different possible future for each sample of its latent variables, and is the first to provide effective Stochastic multi-frame prediction for real-world video.
...
...