Saliency-Guided Attention Network for Image-Sentence Matching

@article{Ji2019SaliencyGuidedAN,
  title={Saliency-Guided Attention Network for Image-Sentence Matching},
  author={Zhong Ji and Haoran Wang and J. Han and Yanwei Pang},
  journal={2019 IEEE/CVF International Conference on Computer Vision (ICCV)},
  year={2019},
  pages={5753-5762}
}
This paper studies the task of matching image and sentence, where learning appropriate representations across the multi-modal data appears to be the main challenge. [...] Key Method The proposed SAN mainly includes three components: saliency detector, Saliency-weighted Visual Attention (SVA) module, and Saliency-guided Textual Attention (STA) module. Concretely, the saliency detector provides the visual saliency information as the guidance for the two attention modules.Expand
SMAN: Stacked Multimodal Attention Network for Cross-Modal Image-Text Retrieval.
TLDR
A stacked multimodal attention network (SMAN) is proposed that makes use of the stacked multimmodal attention mechanism to exploit the fine-grained interdependencies between image and text, thereby mapping the aggregation of attentive fragments into a common space for measuring cross-modal similarity. Expand
Learning Dual Semantic Relations With Graph Attention for Image-Text Matching
TLDR
A novel multi-level semantic relations enhancement approach named Dual Semantic Relations Attention Network (DSRAN) is proposed which mainly consists of two modules, separate semantic relations module and the joint semantic relations Module for region-level relations enhancement and regional-global relations enhancement at the same time. Expand
IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval
TLDR
This paper proposes an Iterative Matching with Recurrent Attention Memory method, in which correspondences between images and texts are captured with multiple steps of alignments, and introduces an iterative matching scheme to explore such fine-grained correspondence progressively. Expand
Stacked squeeze-and-excitation recurrent residual network for visual-semantic matching
TLDR
A Stacked Squeeze-and-Excitation Recurrent Residual Network (SER2-Net) for visual-textual matching is introduced and a novel objective namely Cross-modal Semantic Discrepancy (CMSD) loss, which is characterized by exploiting the interdependency among different semantic levels to narrow the cross- modal distribution discrepancy is proposed. Expand
Similarity Reasoning and Filtration for Image-Text Matching
TLDR
A novel Similarity Graph Reasoning and Attention Filtration network for image-text matching using vector-based similarity representations is proposed and the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets is demonstrated. Expand
Context-Aware Multi-View Summarization Network for Image-Text Matching
TLDR
A novel context-aware multi-view summarization network to summarize context-enhanced visual region information from multiple views and designs an adaptive gating self-attention module to extract representations of visual regions and words. Expand
Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence
TLDR
This paper proposes a novel framework that maps instances into multiple individual embedding spaces so that they can capture multiple relationships between instances, leading to compelling video retrieval. Expand
Structure-Consistent Weakly Supervised Salient Object Detection with Local Saliency Coherence
TLDR
This work proposes a one-round end-to-end training approach for weakly supervised salient object detection via scribble annotations without pre/post-processing operations or extra supervision data, and designs a saliency structure consistency loss as self-consistent mechanism to ensure consistent saliency maps are predicted with different scales of the same image as input. Expand
Deep Relation Embedding for Cross-Modal Retrieval
TLDR
A Cross-modal Relation Guided Network (CRGN) is proposed to embed image and text into a latent feature space and achieves better or comparable performance with the state-of-the-art methods with notable efficiency. Expand
Enhancing Cross-Modal Retrieval Based on Modality-Specific and Embedding Spaces
TLDR
A new approach that drastically improves cross-modal retrieval performance in vision and language (hereinafter referred to as “vision and language retrieval”) is proposed in this paper, which makes use of multiple individual representation spaces through text-to-image and image- to-text models. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 52 REFERENCES
Dual Attention Networks for Multimodal Reasoning and Matching
TLDR
This work proposes Dual Attention Networks which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language and introduces two types of DANs for multimodal reasoning and matching, respectively. Expand
Dual-Path Convolutional Image-Text Embedding
TLDR
This paper builds a convolutional network amenable for fine-tuning the visual and textual representations, where the entire network only contains four components, i.e., convolution layer, pooling layer, rectified linear unit function (ReLU), and batch normalisation. Expand
Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM
  • Yan Huang, Wei Wang, Liang Wang
  • Computer Science, Mathematics
  • 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
TLDR
Extensive experiments show that the proposed selective multimodal Long Short-Term Memory network can well match image and sentence with complex content, and achieve the state-of-the-art results on two public benchmark datasets. Expand
R³Net: Recurrent Residual Refinement Network for Saliency Detection
TLDR
A novel recurrent residual refinement network (RNet) equipped with residual refinement blocks (RRBs) to more accurately detect salient regions of an input image that outperforms competitors in all the benchmark datasets. Expand
Stacked Cross Attention for Image-Text Matching
TLDR
Stacked Cross Attention to discover the full latent alignments using both image regions and words in sentence as context and infer the image-text similarity achieves the state-of-the-art results on the MS-COCO and Flickr30K datasets. Expand
Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding
TLDR
This work presents a hierarchical structured recurrent neural network (RNN), namely Hierarchical Multimodal LSTM (HM-LSTM), which exploits the hierarchical relations between sentences and phrases, and between whole images and image regions, to jointly establish their representations. Expand
Learning Semantic Concepts and Order for Image and Sentence Matching
TLDR
A semantic-enhanced image and sentence matching model is proposed, which can improve the image representation by learning semantic concepts and then organizing them in a correct semantic order. Expand
Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models
TLDR
This work proposes to incorporate generative processes into the cross-modal feature embedding, through which it is able to learn not only the global abstract features but also the local grounded features of image-text pairs. Expand
Deeply Supervised Salient Object Detection with Short Connections
TLDR
A new saliency method is proposed by introducing short connections to the skip-layer structures within the HED architecture, which produces state-of-the-art results on 5 widely tested salient object detection benchmarks, with advantages in terms of efficiency, effectiveness, and simplicity over the existing algorithms. Expand
Multimodal Convolutional Neural Networks for Matching Image and Sentence
TLDR
The m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities to significantly outperform the state-of-the-art approaches for bidirectional image and sentence retrieval on the Flickr8K and Flickr30K datasets. Expand
...
1
2
3
4
5
...