Visual Semantic Reasoning for Image-Text Matching

@article{Li2019VisualSR,
  title={Visual Semantic Reasoning for Image-Text Matching},
  author={Kunpeng Li and Yulun Zhang and K. Li and Yuanyuan Li and Yun Raymond Fu},
  journal={2019 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2019},
  pages={4653-4661}
}
  • Kunpeng Li, Yulun Zhang, +2 authors Y. Fu
  • Published 2019
  • Computer Science
  • 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
Image-text matching has been a hot research topic bridging the vision and language areas. It remains challenging because the current representation of image usually lacks global semantic concepts as in its corresponding text caption. To address this issue, we propose a simple and interpretable reasoning model to generate visual representation that captures key objects and semantic concepts of a scene. Specifically, we first build up connections between image regions and perform reasoning with… Expand
Exploiting Visual Semantic Reasoning for Video-Text Retrieval
TLDR
This work proposes a Visual Semantic Enhanced Reasoning Network (ViSERN), which considers frame regions as vertices and construct a fully-connected semantic correlation graph to exploit reasoning between frame regions. Expand
Learning Hierarchical Visual-Semantic Representation with Phrase Alignment
  • Baoming Yan, Qingheng Zhang, +6 authors Binqiang Zhao
  • Computer Science
  • ICMR
  • 2021
TLDR
This work proposes a Hierarchical Visual-Semantic Network with fine-grained semantic alignment to exploit the hierarchical structure in both image and text and achieves state-of-the-art performance on Flickr30K and MS-COCO datasets. Expand
Dual Semantic Relationship Attention Network for Image-Text Matching
  • Keyu Wen, Xiaodong Gu
  • Computer Science
  • 2020 International Joint Conference on Neural Networks (IJCNN)
  • 2020
TLDR
A novel Dual Semantic Relationship Attention Network is proposed which mainly consists of two modules, separate semantic relationship modules and the joint semantic relationship module, thus promoting the image-text matching process. Expand
Learning Dual Semantic Relations With Graph Attention for Image-Text Matching
TLDR
A novel multi-level semantic relations enhancement approach named Dual Semantic Relations Attention Network (DSRAN) is proposed which mainly consists of two modules, separate semantic relations module and the joint semantic relations Module for region-level relations enhancement and regional-global relations enhancement at the same time. Expand
Similarity Reasoning and Filtration for Image-Text Matching
TLDR
A novel Similarity Graph Reasoning and Attention Filtration network for image-text matching using vector-based similarity representations is proposed and the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets is demonstrated. Expand
Multi-level similarity learning for image-text retrieval
TLDR
A multi-level representation learning for image-text retrieval task, which utilizes semantic-level, structural-level and contextual information to improve the quality of visual and textual representation. Expand
Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval
TLDR
This paper employs a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image. Expand
Cross-modal multi-relationship aware reasoning for image-text matching
TLDR
A new method is proposed to extract multi-relationship and to learn the correlations between image regions, including two kinds of visual relations: the geometric position relation and semantic interaction, which show that CMRN achieved superior performance when compared with state-of-the-art methods. Expand
Visual-Semantic Matching by Exploring High-Order Attention and Distraction
TLDR
Comprehensive experiments and ablation studies on two large public datasets demonstrate the superiority of the proposed method and the effectiveness of both high-order attention and distraction. Expand
Multi-Modal Memory Enhancement Attention Network for Image-Text Matching
TLDR
This paper proposes to achieve the fine-grained visual-textual alignment from two aspects: exploiting attention mechanism to locate the semantically meaningful portion and leveraging the memory network to capture the long-term contextual knowledge. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 44 REFERENCES
Learning Semantic Concepts and Order for Image and Sentence Matching
TLDR
A semantic-enhanced image and sentence matching model is proposed, which can improve the image representation by learning semantic concepts and then organizing them in a correct semantic order. Expand
Stacked Cross Attention for Image-Text Matching
TLDR
Stacked Cross Attention to discover the full latent alignments using both image regions and words in sentence as context and infer the image-text similarity achieves the state-of-the-art results on the MS-COCO and Flickr30K datasets. Expand
Exploring Visual Relationship for Image Captioning
TLDR
This paper introduces a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework that novelly integrates both semantic and spatial object relationships into image encoder. Expand
Deep visual-semantic alignments for generating image descriptions
TLDR
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented. Expand
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Expand
Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding
TLDR
This work presents a hierarchical structured recurrent neural network (RNN), namely Hierarchical Multimodal LSTM (HM-LSTM), which exploits the hierarchical relations between sentences and phrases, and between whole images and image regions, to jointly establish their representations. Expand
Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM
  • Yan Huang, Wei Wang, Liang Wang
  • Computer Science, Mathematics
  • 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
TLDR
Extensive experiments show that the proposed selective multimodal Long Short-Term Memory network can well match image and sentence with complex content, and achieve the state-of-the-art results on two public benchmark datasets. Expand
Leveraging Visual Question Answering for Image-Caption Ranking
TLDR
This work views VQA as a “feature extraction” module to extract image and caption representations and finds that incorporating and reasoning about consistency between images and captions significantly improves performance. Expand
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models
TLDR
This work proposes to incorporate generative processes into the cross-modal feature embedding, through which it is able to learn not only the global abstract features but also the local grounded features of image-text pairs. Expand
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
TLDR
A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA. Expand
...
1
2
3
4
5
...