Multi-modal Transformer for Video Retrieval

@inproceedings{Gabeur2020MultimodalTF,
  title={Multi-modal Transformer for Video Retrieval},
  author={Valentin Gabeur and Chen Sun and Alahari Karteek and Cordelia Schmid},
  booktitle={ECCV},
  year={2020}
}
The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them… 
Everything at Once - Multi-modal Fusion Transformer for Video Retrieval
TLDR
This work presents a multi-modal, modality agnostic fusion transformer that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a fused representation in a joined multi- modal embedding space.
CRET: Cross-Modal Retrieval Transformer for Efficient Text-Video Retrieval
TLDR
To balance the information loss and computational overhead when sampling frames from a given video, a novel GEES loss is presented, which implicitly conducts dense sampling in the video embedding space, without suffering from heavy computational cost.
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
TLDR
Hierarchical Transformer (HiT) is proposed, which performs Hierarchical Cross-modal Contrastive Matching in both feature-level and semantic-level, achieving multi-view and comprehensive retrieval results and is inspired by MoCo.
Semantic Role Aware Correlation Transformer For Text To Video Retrieval
TLDR
A novel transformer is proposed that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at different levels.
Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment
TLDR
This paper leverages the hierarchical video-text alignment to fully explore the detailed diverse characteristics in multi-modal cues for fine-grained alignment with local semantics from phrases, as well as to capture a high-level semantic correspondence.
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
TLDR
A novel mixture-of-expert transformer RoME is proposed that disentangles the text and the video into three levels; the roles of spatial contexts, temporal contexts, and object contexts to fully exploit visual and text embeddings at both global and local levels.
Object-aware Video-language Pre-training for Retrieval
TLDR
Object-aware Transformers is presented, an object-centric approach that ex-tends video-language transformer to incorporate object representations to leverage the bounding boxes and object tags to guide the training process.
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
TLDR
This paper proposes a Cooperative hierarchical Transformer to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities in real-world video-text tasks.
Domain Adaptation in Multi-View Embedding for Cross-Modal Video Retrieval
TLDR
This paper proposes a novel iterative domain alignment method by means of pseudo-labelling target videos and cross-domain (i.e. source-target) ranking, which adapts the embedding space to the target gallery, consistently outperforming source-only as well as marginal and conditional alignment methods.
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
TLDR
A novel High-resolution and Diversified VI deo- LA nguage pre-training model (HD-VILA) for many visual tasks that achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks.
...
...

References

SHOWING 1-10 OF 35 REFERENCES
CVPR 2020 Video Pentathlon Challenge: Multi-modal Transformer for Video Retrieval
TLDR
A framework based on a multi-modal transformer architecture is presented, which jointly encodes the different modalities in video, and allows them to attend to each other, and which allowed us to achieve the top result of the CVPR 2020 video pentathlon challenge.
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
TLDR
This paper proposes a novel framework that simultaneously utilizes multi-modal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval and explores several loss functions in training the embedding.
Joint embeddings with multimodal cues for video-text retrieval
TLDR
This paper proposes a framework that simultaneously utilizes multimodal visual cues by a “mixture of experts” approach for retrieval and conducts extensive experiments to verify that the system is able to boost the performance of the retrieval task compared to the state of the art.
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
TLDR
This work proposes a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training and demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video- to-text retrieval tasks.
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings
TLDR
This paper proposes to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions by building a separate multi-modal embedding space for each PoS tag, which enables learning specialised embedding spaces that offer multiple views of the same embedded entities.
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
TLDR
This work focuses on video-language tasks including multimodal retrieval and video QA, and evaluates the JSFusion model in three retrieval and VQA tasks in LSMDC, for which the model achieves the best performance reported so far.
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
TLDR
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.
Use What You Have: Video retrieval using representations from collaborative experts
TLDR
This paper proposes a collaborative experts model to aggregate information from these different pre-trained experts and assess the approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet.
Cross-Modal and Hierarchical Modeling of Video and Text
TLDR
H hierarchical sequence embedding is introduced, a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information, for hierarchical sequential data where there are correspondences across multiple modalities.
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
TLDR
This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
...
...