• Corpus ID: 237454570

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

@article{Cheng2021ImprovingVR,
  title={Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss},
  author={Xingyi Cheng and Hezheng Lin and Xiangyu Wu and F. Yang and Dong Shen},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.04290}
}
Employing large-scale pre-trained model CLIP to conduct video-text retrieval task (VTR) has become a new trend, which exceeds previous VTR methods. Though, due to the heterogeneity of structures and contents between video and text, previous CLIP-based models are prone to overfitting in the training phase, resulting in relatively poor retrieval performance. In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss… 

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

TLDR
A novel token shift and selection transformer architecture, which dynamically adjusts the token sequence and selects informative tokens in both temporal and spatial dimensions from input video samples, which achieves state-of-the-art performance on major text-video retrieval benchmarks.

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

TLDR
This paper proposes a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP, and shows that this approach improves the performance of CLip on video-text retrieval by a large margin.

Improving video retrieval using multilingual knowledge transfer

TLDR
This paper proposes a framework MKTVR, that utilizes knowledge transfer from a multilin- gual model to boost the performance of video retrieval and achieves state-of-the-art results on all datasets out- performing previous models.

CLIP2TV: An Empirical Study on Transformer-based Methods for Video-Text Retrieval

TLDR
This report presents CLIP2TV, aiming at exploring where the critical elements lie in transformer based methods, and revisits some recent works on multi-modal learning, then introduces some techniques into video-text retrieval, and evaluates them through extensive experiments in different configurations.

LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval

TLDR
This work presents a novel mechanism for learning the translation relationship from a sourcemodality space S to a target modality space T without the need for a joint latent space, which bridges the gap between visual and textual domains.

Multi-query Video Retrieval

TLDR
This paper shows that multi-query retrieval task effectively mitigates the dataset noise introduced by imperfect annotations and better correlates with human judgement on evaluating retrieval abilities of current models, and investigates several methods which leverage multiple queries at training time.

M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval

TLDR
A multi-level multi-modal hybrid fusion (M2HF) network to explore comprehensive interactions between text queries and each modality content in videos and provides two kinds of training strategies, including an ensemble manner and an end-to-end manner.

Learning Audio-Video Modalities from Image Captions

TLDR
A new video mining pipeline is proposed which involves transferring captions from image captioning datasets to video clips with no additional manual effort, and it is shown that training a multimodal transformed based model on this data achieves competitive performance on video retrieval and video captioning.

FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot Video Understanding Tasks

TLDR
This paper presents a fine-tuning strategy to refine these large-scale pretrained image-text models for zero-shot video understanding tasks and shows that by carefully adapting these models they obtain considerable improvements on two zero- shot Action Recognition tasks and three Text-to-video Retrieval tasks.

Frozen CLIP Models are Efficient Video Learners

TLDR
This paper presents E fficient V ideo L earning (EVL) – an efficient framework for directly training high-quality video recognition models with frozen CLIP features that adopt a local temporal module in each decoder layer to discover temporal clues from adjacent frames and their attention maps.

References

SHOWING 1-10 OF 45 REFERENCES

CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

TLDR
This work leverage pretrained imagelanguage model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets.

HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval

TLDR
Hierarchical Transformer (HiT) is proposed, which performs Hierarchical Cross-modal Contrastive Matching in both feature-level and semantic-level, achieving multi-view and comprehensive retrieval results and is inspired by MoCo.

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval

TLDR
This paper proposes a novel framework that simultaneously utilizes multi-modal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval and explores several loss functions in training the embedding.

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

TLDR
An efficient global-local alignment method is proposed to provide a global cross-modal measurement that is complementary to the local perspective and enables the meticulous local comparison and reduces the computational cost of the interaction between each text-video pair.

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

TLDR
An end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets and yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.

Multi-modal Transformer for Video Retrieval

TLDR
A multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others, and a novel framework to establish state-of-the-art results for video retrieval on three datasets.

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

TLDR
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.

Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval

TLDR
A Hierarchical Cross-Modal Graph Consistency Learning Network (HCGC) for video-text retrieval task, which considers multi-level graph consistency for video -text matching and designs three types of graph consistency: inter-graph parallel consistency, inter- graph cross consistency and intra-graph cross consistency.

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

TLDR
This work focuses on video-language tasks including multimodal retrieval and video QA, and evaluates the JSFusion model in three retrieval and VQA tasks in LSMDC, for which the model achieves the best performance reported so far.