Identity-Aware Textual-Visual Matching with Latent Co-attention

  title={Identity-Aware Textual-Visual Matching with Latent Co-attention},
  author={Shuang Li and Tong Xiao and Hongsheng Li and Wei Yang and Xiaogang Wang},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
Textual-visual matching aims at measuring similarities between sentence descriptions and images. Most existing methods tackle this problem without effectively utilizing identity-level annotations. In this paper, we propose an identity-aware two-stage framework for the textual-visual matching problem. Our stage-1 CNN-LSTM network learns to embed cross-modal features with a novel Cross-Modal Cross-Entropy (CMCE) loss. The stage-1 network is able to efficiently screen easy incorrect matchings and… 

Figures and Tables from this paper

Learning Aligned Image-Text Representations Using Graph Attentive Relational Network
A graph attentive relational network (GARN) is proposed to learn the aligned image-text representations by modeling the relationships between noun phrases in a text for the identity-aware image- Text matching.
Pose-Guided Multi-Granularity Attention Network for Text-Based Person Search
A pose-guided multi-granularity attention network (PMA) is proposed, which employs pose information to learn latent semantic alignment between visual body part and textual noun phrase and Experimental results show that this approach outperforms the state-of-the-art methods by 15 % in terms of the top-1 metric.
Cross-Modal Attention With Semantic Consistence for Image–Text Matching
The proposed CASC is a joint framework that performs cross-modal attention for local alignment and multilabel prediction for global semantic consistence and directly extracts semantic labels from available sentence corpus without additional labor cost, which provides a global similarity constraint for the aggregated region-word similarity obtained by the local alignment.
Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification
This paper proposes a novel method that automatically extracts semantically aligned part-level features from the two modalities, and introduces a Compound Ranking (CR) loss that makes use of textual descriptions for other images of the same identity to provide extra supervision, thereby effectively reducing the intra-class variance in textual features.
Context-Aware Attention Network for Image-Text Retrieval
A unified Context-Aware Attention Network (CAAN) is proposed, which selectively focuses on critical local fragments (regions and words) by aggregating the global context and simultaneously utilizes global inter-modal alignments and intra- modal correlations to discover latent semantic relations.
Cascade Attention Network for Person Search: Both Image and Text-Image Similarity Selection
A cascade attention network (CAN) to progressively select from person image and text-image similarity to select the description-related similarity scores from those local similarities is proposed.
AXM-Net: Cross-Modal Context Sharing Attention Network for Person Re-ID
This work presents AXM-Net, a novel CNN based architecture designed for learning semantically aligned visual and textual representations that outperforms the current state-of-the-art methods by a significant margin.
Pose-Guided Joint Global and Attentive Local Matching Network for Text-Based Person Search
Experimental results show that the proposed pose-guided joint global and attentive local matching network (GALM) outperforms the state-of-the-art methods by 15 \% in terms of top-1 metric.
AXM-Net: Cross-Modal Alignment and Contextual Attention for Person Re-ID
A novel convolutional neural network (CNN) based architecture designed to learn semantically Aligned cross-Modal (AXM-Net) visual and textual representations that outperforms the current state-of-the-art method on two tasks, person search and cross-modal Re-ID.
Dual-path CNN with Max Gated block for Text-Based Person Re-identification


ViP-CNN: Visual Phrase Guided Convolutional Neural Network
In ViP-CNN, a Phrase-guided Message Passing Structure (PMPS) is presented to establish the connection among relationship components and help the model consider the three problems jointly and Experimental results show that the Vip-CNN outperforms the state-of-art method both in speed and accuracy.
Learning Deep Representations of Fine-Grained Visual Descriptions
This model achieves strong performance on zero-shot text-based image retrieval and significantly outperforms the attribute-based state-of-the-art for zero- shot classification on the Caltech-UCSD Birds 200-2011 dataset.
Learning Deep Structure-Preserving Image-Text Embeddings
This paper proposes a method for learning joint embeddings of images and text using a two-branch neural network with multiple layers of linear projections followed by nonlinearities. The network is
DeViSE: A Deep Visual-Semantic Embedding Model
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.
Deep Visual-Semantic Alignments for Generating Image Descriptions
  • A. Karpathy, Li Fei-Fei
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.
Visual7W: Grounded Question Answering in Images
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks.
Hierarchical Question-Image Co-Attention for Visual Question Answering
This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN).
Dual Attention Networks for Multimodal Reasoning and Matching
This work proposes Dual Attention Networks which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language and introduces two types of DANs for multimodal reasoning and matching, respectively.
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
This work extensively evaluates Multimodal Compact Bilinear pooling (MCB) on the visual question answering and grounding tasks and consistently shows the benefit of MCB over ablations without MCB.
Joint Detection and Identification Feature Learning for Person Search
A new deep learning framework for person search that jointly handles pedestrian detection and person re-identification in a single convolutional neural network and converges much faster and better than the conventional Softmax loss.