Learning Two-Branch Neural Networks for Image-Text Matching Tasks

@article{Wang2019LearningTN,
  title={Learning Two-Branch Neural Networks for Image-Text Matching Tasks},
  author={Liwei Wang and Yin Li and Jing Huang and Svetlana Lazebnik},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2019},
  volume={41},
  pages={394-407}
}
  • Liwei Wang, Yin Li, S. Lazebnik
  • Published 11 April 2017
  • Computer Science
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
Image-language matching tasks have recently attracted a lot of attention in the computer vision field. These tasks include image-sentence matching, i.e., given an image query, retrieving relevant sentences and vice versa, and region-phrase matching or visual grounding, i.e., matching a phrase to relevant regions. This paper investigates two-branch neural networks for learning the similarity between these two data modalities. We propose two network structures that produce different output… 

Figures and Tables from this paper

A neural architecture to learn image-text joint embedding
TLDR
This project builds two-branch neural networks for learning the similarity and train and validate on Flickr30K and MSCOCO datasets for imagesentence retrieval task i.e given an input image, the goal is to find the best matching sentences from a database.
A unified cycle-consistent neural model for text and image retrieval
TLDR
This proposal leverages an end-to-end trainable architecture that can translate text into image features and vice versa and regularizes this mapping with a cycle-consistency criterion and confirms the appropriateness of using a cycles-consistent constrain for the text-image matching task.
Dual-Path Convolutional Image-Text Embedding
TLDR
This paper builds a convolutional network amenable for fine-tuning the visual and textual representations, where the entire network only contains four components, i.e., convolution layer, pooling layer, rectified linear unit function (ReLU), and batch normalisation.
Dual-path Convolutional Image-Text Embeddings with Instance Loss
TLDR
An end-to-end dual-path convolutional network to learn the image and text representations based on an unsupervised assumption that each image/text group can be viewed as a class, which allows the system to directly learn from the data and fully utilize the supervision.
Learning Image-Text Embeddings with Instance Loss
TLDR
An end-to-end dual-path convolutional network to learn the image and text representations based on an unsupervised assumption that each image / text group can be viewed as a class so the network can learn the fine granularity from every image/text group.
Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval
TLDR
This work proposes a two-stage approach for the task that can augment a typical supervised pair-wise ranking loss based formulation with weakly-annotated web images to learn a more robust visual-semantic embedding.
A Neighbor-aware Approach for Image-text Matching
  • Chunxiao Liu, Zhendong Mao, W. Zang, Bin Wang
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
A neighborhood-aware network to image-text matching is proposed where an intra-attention module and neighbor-aware ranking loss are proposed to jointly distinguish data with different semantics, more importantly, semantic unrelated data in a neighborhood can be distinguished.
Cross-Modal Attention With Semantic Consistence for Image–Text Matching
TLDR
The proposed CASC is a joint framework that performs cross-modal attention for local alignment and multilabel prediction for global semantic consistence and directly extracts semantic labels from available sentence corpus without additional labor cost, which provides a global similarity constraint for the aggregated region-word similarity obtained by the local alignment.
Position Focused Attention Network for Image-Text Matching
TLDR
This paper proposes a novel position focused attention network (PFAN) to investigate the relation between the visual and the textual views, and integrates the object position clue to enhance the visual-text joint-embedding learning.
Context-Aware Attention Network for Image-Text Retrieval
TLDR
A unified Context-Aware Attention Network (CAAN) is proposed, which selectively focuses on critical local fragments (regions and words) by aggregating the global context and simultaneously utilizes global inter-modal alignments and intra- modal correlations to discover latent semantic relations.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 72 REFERENCES
Learning Deep Structure-Preserving Image-Text Embeddings
This paper proposes a method for learning joint embeddings of images and text using a two-branch neural network with multiple layers of linear projections followed by nonlinearities. The network is
Linking Image and Text with 2-Way Nets
TLDR
A novel, bi-directional neural network architecture for the task of matching vectors from two data sources, enabling the use of Euclidean loss for correlation maximization and showing state of the art results on a number of computer vision matching tasks.
Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions
TLDR
A new model is presented that can classify unseen categories from their textual description and takes advantage of the architecture of CNNs and learn features at different layers, rather than just learning an embedding space for both modalities, as is common with existing approaches.
Multimodal Convolutional Neural Networks for Matching Image and Sentence
TLDR
The m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities to significantly outperform the state-of-the-art approaches for bidirectional image and sentence retrieval on the Flickr8K and Flickr30K datasets.
Grounding of Textual Phrases in Images by Reconstruction
TLDR
A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets.
Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
TLDR
The m-RNN model directly models the probability distribution of generating a word given previous words and an image, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
TLDR
A Fully Convolutional Localization Network (FCLN) architecture is proposed that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with asingle round of optimization.
Show and tell: A neural image caption generator
TLDR
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract)
TLDR
This work proposes to frame sentence-based image annotation as the task of ranking a given pool of captions, and introduces a new benchmark collection, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.
Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models
TLDR
This work introduces the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder, and shows that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic.
...
1
2
3
4
5
...