Corpus ID: 189762371

Contrastive Bidirectional Transformer for Temporal Representation Learning

@article{Sun2019ContrastiveBT,
  title={Contrastive Bidirectional Transformer for Temporal Representation Learning},
  author={Chen Sun and Fabien Baradel and Kevin P. Murphy and Cordelia Schmid},
  journal={ArXiv},
  year={2019},
  volume={abs/1906.05743}
}
This paper aims at learning representations for long sequences of continuous signals. Recently, the BERT model has demonstrated the effectiveness of stacked transformers for representing sequences of discrete signals (i.e. word tokens). Inspired by its success, we adopt the stacked transformer architecture, but generalize its training objective to maximize the mutual information between the masked signals, and the bidirectional context, via contrastive loss. This enables the model to handle… Expand
Composable Augmentation Encoding for Video Representation Learning
TLDR
It is shown that representations learned by the proposed 'augmentation aware' contrastive learning framework encode valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks. Expand
Representation Learning via Global Temporal Alignment and Cycle-Consistency
TLDR
A weakly supervised method for representation learning based on aligning temporal sequences of the same process of dynamic time warping as a supervisory signal and proposes a loss based on scoring the optimal sequence alignment to train an embedding network. Expand
Active Contrastive Learning of Audio-Visual Video Representations
TLDR
An active contrastive learning approach that builds an actively sampled dictionary with diverse and informative items, which improves the quality of negative samples and improves performances on tasks where there is high mutual information in the data, e.g., video classification. Expand
Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning
TLDR
An encoder-decoder backbone using transformer models is designed and an attention based cross-modality temporal alignment method that combines the sequential video frames and transcript sentences is developed, which achieves state-of-the-art results on YouCookII dataset. Expand
Memory-augmented Dense Predictive Coding for Video Representation Learning
TLDR
A new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) is proposed for the self-supervised learning from video, in particular for representations for action recognition, trained with a predictive attention mechanism over the set of compressed memories. Expand
Learning Audio-Visual Representations with Active Contrastive Coding
TLDR
This paper proposes an active contrastive coding approach that builds an 'actively sampled' dictionary with diverse and informative items, which improves the quality of negative samples and achieves substantially improved results on tasks where there is high mutual information in the data. Expand
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
TLDR
This paper proposes a Cooperative hierarchical Transformer to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities in real-world video-text tasks. Expand
Representation Learning with Video Deep InfoMax
TLDR
This paper finds that drawing views from both natural-rate sequences and temporally-downsampled sequences yields results on Kinetics-pretrained action recognition tasks which match or outperform prior state-of-the-art methods that use more costly large-time-scale transformer models. Expand
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
TLDR
The proposed DiM (short for Disentangled Multimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task, and the effectiveness of the introduced visual concepts is demonstrated. Expand
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30kCaptions, and VQA 2.0. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 41 REFERENCES
VideoBERT: A Joint Model for Video and Language Representation Learning
TLDR
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification
TLDR
This paper forms an approach for learning a visual representation from the raw spatiotemporal signals in videos using a Convolutional Neural Network, and shows that this method captures information that is temporally varying, such as human pose. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
End-to-End Dense Video Captioning with Masked Transformer
TLDR
This work proposes an end-to-end transformer model, which employs a self-attention mechanism, which enables the use of efficient non-recurrent structure during encoding and leads to performance improvements. Expand
Representation Learning with Contrastive Predictive Coding
TLDR
This work proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which it calls Contrastive Predictive Coding, and demonstrates that the approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments. Expand
Two-Stream Convolutional Networks for Action Recognition in Videos
TLDR
This work proposes a two-stream ConvNet architecture which incorporates spatial and temporal networks and demonstrates that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Expand
Generating Videos with Scene Dynamics
TLDR
A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines. Expand
Long-Term Temporal Convolutions for Action Recognition
TLDR
It is demonstrated that LTC-CNN models with increased temporal extents improve the accuracy of action recognition and the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields, and the importance of high-quality optical flow estimation for learning accurate action models. Expand
SoundNet: Learning Sound Representations from Unlabeled Video
TLDR
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels. Expand
...
1
2
3
4
5
...