Query-Guided Regression Network with Context Policy for Phrase Grounding

@article{Chen2017QueryGuidedRN,
  title={Query-Guided Regression Network with Context Policy for Phrase Grounding},
  author={Kan Chen and Rama Kovvuri and Ramakant Nevatia},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  year={2017},
  pages={824-832}
}
Given a textual description of an image, phrase grounding localizes objects in the image referred by query phrases in the description. State-of-the-art methods address the problem by ranking a set of proposals based on the relevance to each query, which are limited by the performance of independent proposal generation systems and ignore useful cues from context in the description. In this paper, we adopt a spatial regression method to break the performance limit, and introduce reinforcement… 

Figures and Tables from this paper

MSRC: multimodal spatial regression with semantic context for phrase grounding
TLDR
A novel multimodal spatial regression with semantic context (MSRC) system which not only predicts the location of ground truth based on proposal bounding boxes, but also refines prediction results by penalizing similarities of different queries coming from same sentences.
PIRC Net : Using Proposal Indexing, Relationships and Context for Phrase Grounding
TLDR
This paper presents a framework that leverages information such as phrase category, relationships among neighboring phrases in a sentence and context to improve the performance of phrase grounding systems and proposes three modules: Proposal Indexing Network (PIN), Inter-phrase Regression Network (IRN) and Proposal Ranking Network (PRN).
MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level
TLDR
This work proposes to utilize spatial attention networks for image-level visual-textual fusion preserving local (word) and global (phrase) information to refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query.
Knowledge Aided Consistency for Weakly Supervised Phrase Grounding
TLDR
A novel Knowledge Aided Consistency Network (KAC Net) is proposed which is optimized by reconstructing input query and proposal's information, and introduced a Knowledge Based Pooling (KBP) gate to focus on query-related proposals.
Relation-aware Instance Refinement for Weakly Supervised Visual Grounding
TLDR
This paper proposes a novel context-aware weakly-supervised learning method that incorporates coarse-to-fine object refinement and entity relation modeling into a two-stage deep network, capable of producing more accurate object representation and matching.
Language Attention Proposal Attention + Training Inference man in white on the left holding a bat Subject Location Context Input query Input image
TLDR
A novel end-to-end adaptive reconstruction network (ARN) that builds the correspondence between image region proposal and query in an adaptive manner: adaptive grounding and collaborative reconstruction.
Adaptive Reconstruction Network for Weakly Supervised Referring Expression Grounding
TLDR
A novel end-to-end adaptive reconstruction network (ARN) that builds the correspondence between image region proposal and query in an adaptive manner: adaptive grounding and collaborative reconstruction.
Zero-Shot Grounding of Objects From Natural Language Queries
TLDR
A new single-stage model called ZSGNet is proposed which combines the detector network and the grounding system and predicts classification scores and regression parameters and achieves state-of-the-art performance on Flickr30k and ReferIt under the usual “seen” settings and performs significantly better than baseline in the zero-shot setting.
Utilizing Every Image Object for Semi-supervised Phrase Grounding
TLDR
This paper proposes to use learned location and subject embedding predictors (LSEP) to generate the corresponding language embeddings for objects lacking annotated queries in the training set and demonstrates that the predictors allow the grounding system to learn from the objects without labeled queries and improve accuracy.
Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding
TLDR
A knowledge-guided pairwise reconstruction network (KPRN) is proposed, which models the relationship between the target entity (subject) and contextual entity (object) as well as grounds these two entities.
...
...

References

SHOWING 1-10 OF 43 REFERENCES
Grounding of Textual Phrases in Images by Reconstruction
TLDR
A novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly, and demonstrates the effectiveness on the Flickr 30k Entities and ReferItGame datasets.
AMC: Attention Guided Multi-modal Correlation Learning for Image Search
TLDR
A novel Attention guided Multi-modal Correlation (AMC) learning method which consists of a jointly learned hierarchy of intra and inter-attention networks to leverage visual and textual modalities for image search by learning their correlation with input query.
Natural Language Object Retrieval
TLDR
Experimental results demonstrate that the SCRC model effectively utilizes both local and global information, outperforming previous baseline methods significantly on different datasets and scenarios, and can exploit large scale vision and language datasets for knowledge transfer.
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
TLDR
This paper presents Flickr30K Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes.
Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection
TLDR
A deep Variation-structured Re-inforcement Learning (VRL) framework is proposed to sequentially discover object relationships and attributes in the whole image, and an ambiguity-aware object mining scheme is used to resolve semantic ambiguity among object categories that the object detector fails to distinguish.
From captions to visual concepts and back
TLDR
This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.
Deep Visual-Semantic Alignments for Generating Image Descriptions
  • A. Karpathy, Li Fei-Fei
  • Computer Science
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 2017
TLDR
A model that generates natural language descriptions of images and their regions based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding is presented.
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
TLDR
This work extensively evaluates Multimodal Compact Bilinear pooling (MCB) on the visual question answering and grounding tasks and consistently shows the benefit of MCB over ablations without MCB.
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
TLDR
This work introduces a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data and introduces a structured max-margin objective that allows this model to explicitly associate fragments across modalities.
Modeling Context Between Objects for Referring Expression Understanding
TLDR
A technique that integrates context between objects to understand referring expressions is proposed, which uses an LSTM to learn the probability of a referring expression, with input features from a region and a context region.
...
...