Boosting Video-Text Retrieval with Explicit High-Level Semantics

  title={Boosting Video-Text Retrieval with Explicit High-Level Semantics},
  author={Haoran Wang and Di Xu and Dongliang He and Fu Li and Zhong Ji and Jungong Han and Errui Ding},
  journal={Proceedings of the 30th ACM International Conference on Multimedia},
  • Haoran WangDi Xu Errui Ding
  • Published 8 August 2022
  • Computer Science
  • Proceedings of the 30th ACM International Conference on Multimedia
Video-text retrieval (VTR) is an attractive yet challenging task for multi-modal understanding, which aims to search for relevant video (text) given a query (video). Existing methods typically employ completely heterogeneous visual-textual information to align video and text, whilst lacking the awareness of homogeneous high-level semantic information residing in both modalities. To fill this gap, in this work, we propose a novel visual-linguistic aligning model named HiSE for VTR, which… 



HANet: Hierarchical Alignment Networks for Video-Text Retrieval

The proposed Hierarchical Alignment Network (HANet) to align different level representations for video-text matching outperforms other state-of-the-art methods, which demonstrates the effectiveness of hierarchical representation and alignment.

Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning

A Hierarchical Graph Reasoning (HGR) model is proposed, which decomposes video-text matching into global-to-local levels and generates hierarchical textual embeddings via attention-based graph reasoning.

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

Hierarchical Transformer (HiT) is proposed, which performs Hierarchical Cross-modal Contrastive Matching in both feature-level and semantic-level, achieving multi-view and comprehensive retrieval results and is inspired by MoCo.

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

This paper proposes a novel framework that simultaneously utilizes multi-modal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval and explores several loss functions in training the embedding.

Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval

A novel memory enhanced embedding learning (MEEL) method for videotext retrieval that fuse the multiply texts corresponding to a video during the training to avoid the fast evolving of the embedding in the memory bank during training.

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

This work proposes a Mixture-of-Embedding-Experts (MEE) model with ability to handle missing input modalities during training and demonstrates significant improvements and outperforms previously reported methods on both text-to-video and video- to-text retrieval tasks.

SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries

As extensive experiments on four benchmarks show, SEA surpasses the state-of-the-art and is extremely ease to implement, making SEA an appealing solution for AVS and promising for continuously advancing the task by harvesting new sentence encoders.

Visual Consensus Modeling for Video-Text Retrieval

This paper makes the first attempt to model the visual consensus by mining the visual concepts from videos and exploiting their co-occurrence patterns within the video and text modalities with no reliance on any additional concept annotations.

Multi-modal Transformer for Video Retrieval

A multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others, and a novel framework to establish state-of-the-art results for video retrieval on three datasets.

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

This work focuses on video-language tasks including multimodal retrieval and video QA, and evaluates the JSFusion model in three retrieval and VQA tasks in LSMDC, for which the model achieves the best performance reported so far.