Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning

@article{Chen2020FineGrainedVR,
  title={Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning},
  author={Shizhe Chen and Yida Zhao and Qin Jin and Qi Wu},
  journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2020},
  pages={10635-10644}
}
  • Shizhe Chen, Yida Zhao, +1 author Qi Wu
  • Published 1 March 2020
  • Computer Science
  • 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. The current dominant approach is to learn a joint embedding space to measure cross-modal similarities. However, simple embeddings are insufficient to represent complicated visual and textual details, such as scenes, objects, actions and their compositions. To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning (HGR) model, which… 
Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval
TLDR
A Hierarchical Cross-Modal Graph Consistency Learning Network (HCGC) for video-text retrieval task, which considers multi-level graph consistency for video -text matching and designs three types of graph consistency: inter-graph parallel consistency, inter- graph cross consistency and intra-graph cross consistency.
HANet: Hierarchical Alignment Networks for Video-Text Retrieval
TLDR
The proposed Hierarchical Alignment Network (HANet) to align different level representations for video-text matching outperforms other state-of-the-art methods, which demonstrates the effectiveness of hierarchical representation and alignment.
Multi-Feature Graph Attention Network for Cross-Modal Video-Text Retrieval
TLDR
A novel Multi-Feature Graph ATtention Network (MFGATN) is proposed, which enriches the representation of each feature in videos with the interchange of high-level semantic information among them and elaborately design a novel Dual Constraint Ranking Loss (DCRL), which simultaneously considers the inter-modal ranking constraint and the intra- modal structure constraint.
Fine-grained Cross-modal Alignment Network for Text-Video Retrieval
TLDR
A novel Fine-grained Cross-modal Alignment Network (FCA-Net) is proposed, which considers the interactions between visual semantic units (i.e., sub-actions/sub-events) in videos and phrases in sentences for cross- modal alignment.
Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval
TLDR
A multi-modal relational graph is contributed to capture the interactions among objects from the visual and textual content to identify the differences among similar video moment candidates and introduces a visual relational graph and textual relational graph to form relation-aware representations via message propagation.
Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval
TLDR
A novel cross-modal retrieval framework that considers the spatialtemporal visual relations among components to enhance global video representation in bridging text-video modalities, and encoded using a multi-layer spatio-temporal transformer to learn visual relational features.
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
TLDR
This paper proposes a Cooperative hierarchical Transformer to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities in real-world video-text tasks.
Progressive Semantic Matching for Video-Text Retrieval
TLDR
This work aims at narrowing semantic gap by a progressive learning process with a coarse-to-fine architecture, and proposes a novel Progressive Semantic Matching (PSM) method, which achieves significant performance improvement compared with state-of-the-art approaches.
HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval
TLDR
Hierarchical Transformer (HiT) is proposed, which performs hierarchical cross-modal contrastive matching in feature-level and semantic-level to achieve multi-view and comprehensive retrieval results and inspired by MoCo, Momentum Cross-Modal Contrast for cross- modal learning to enable large-scale negative interactions on-the-fly.
Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment
TLDR
This paper leverages the hierarchical video-text alignment to fully explore the detailed diverse characteristics in multi-modal cues for fine-grained alignment with local semantics from phrases, as well as to capture a high-level semantic correspondence.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 47 REFERENCES
Cross-Modal and Hierarchical Modeling of Video and Text
TLDR
H hierarchical sequence embedding is introduced, a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information, for hierarchical sequential data where there are correspondences across multiple modalities.
Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings
TLDR
This paper proposes to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions by building a separate multi-modal embedding space for each PoS tag, which enables learning specialised embedding spaces that offer multiple views of the same embedded entities.
Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
TLDR
This paper proposes a novel framework that simultaneously utilizes multi-modal features (different visual characteristics, audio inputs, and text) by a fusion strategy for efficient retrieval and explores several loss functions in training the embedding.
Visual Semantic Reasoning for Image-Text Matching
TLDR
A simple and interpretable reasoning model to generate visual representation that captures key objects and semantic concepts of a scene that outperforms the current best method for image retrieval and caption retrieval on MS-COCO and Flickr30K datasets.
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models
TLDR
This work proposes to incorporate generative processes into the cross-modal feature embedding, through which it is able to learn not only the global abstract features but also the local grounded features of image-text pairs.
A Joint Sequence Fusion Model for Video Question Answering and Retrieval
TLDR
This work focuses on video-language tasks including multimodal retrieval and video QA, and evaluates the JSFusion model in three retrieval and VQA tasks in LSMDC, for which the model achieves the best performance reported so far.
Stacked Cross Attention for Image-Text Matching
TLDR
Stacked Cross Attention to discover the full latent alignments using both image regions and words in sentence as context and infer the image-text similarity achieves the state-of-the-art results on the MS-COCO and Flickr30K datasets.
Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations
TLDR
The Unified VSE outperforms baselines on cross-modal retrieval tasks; the enforcement of the semantic coverage improves the model's robustness in defending text-domain adversarial attacks and empowers the use of visual cues to accurately resolve word dependencies in novel sentences.
Predicting Visual Features From Text for Image and Video Caption Retrieval
TLDR
This paper contributes Word2VisualVec, a deep neural network architecture that learns to predict a visual feature representation from textual input that generalizes Word2 visualVec for video caption retrieval, by predicting from text both three-dimensional convolutional neural network features as well as a visual-audio representation.
Use What You Have: Video retrieval using representations from collaborative experts
TLDR
This paper proposes a collaborative experts model to aggregate information from these different pre-trained experts and assess the approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet.
...
1
2
3
4
5
...