Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues

  title={Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues},
  author={Bryan A. Plummer and Arun Mallya and Christopher M. Cervantes and J. Hockenmaier and Svetlana Lazebnik},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues. We model the appearance, size, and position of entity bounding boxes, adjectives that contain attribute information, and spatial relationships between pairs of entities connected by verbs or prepositions. Special attention is given to relationships between people and clothing or body part mentions, as they are useful for distinguishing individuals. We… 

Figures and Tables from this paper

Propagating Over Phrase Relations for One-Stage Visual Grounding
A linguistic structure guided propagation network for one-stage phrase grounding that explicitly explores the linguistic structure of the sentence and performs relational propagation among noun phrases under the guidance of the linguistic relations between them.
PIRC Net : Using Proposal Indexing, Relationships and Context for Phrase Grounding
This paper presents a framework that leverages information such as phrase category, relationships among neighboring phrases in a sentence and context to improve the performance of phrase grounding systems and proposes three modules: Proposal Indexing Network (PIN), Inter-phrase Regression Network (IRN) and Proposal Ranking Network (PRN).
Natural Language Guided Visual Relationship Detection
This work proposes to use a generic bi-directional recurrent neural network to predict the semantic connection between the participating objects in the relationship from the aspect of natural language to achieve the state-of-the-art on the Visual Relationship Detection and Visual Genome datasets.
Visual Relationship Detection with Language prior and Softmax
  • Jaewon Jung, Jongyoul Park
  • Computer Science
    2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS)
  • 2018
The models in this work outperformed the state of arts without costly linguistic knowledge distillation from a large text corpus and building complex loss functions.
Grounding natural language phrases in images and video
This dissertation introduces a new dataset which provides the ground truth annotations of the location of noun phrase chunks in image captions, and introduces a model which learns a set of models, each of which capture a different concept which is useful in the task.
Semi Supervised Phrase Localization in a Bidirectional Caption-Image Retrieval Framework
A novel deep neural network architecture that links visual regions to corresponding textual segments including phrases and words that outperforms the state of the art results in the semi-supervised phrase localization setting.
Phrase Localization Without Paired Training Examples
This work postulates that paired annotations are unnecessary, and proposes the first method for the phrase localization problem where neither training procedure nor paired, task-specific data is required, and is applicable to any domain and where no paired phrase localization annotation is available.
Open-vocabulary Phrase Detection
This paper addresses a more realistic version of the natural language grounding task where the authors must both identify whether the phrase is relevant to an image and localize the phrase, and proposes a Phrase R-CNN network that extends Faster R- CNN to relate image regions and phrases.
A better loss for visual-textual grounding
This model, although using a simple multi-modal feature fusion component, is able to achieve a higher accuracy than state-of-the-art models on two widely adopted datasets, reaching a better learning balance between the two sub-tasks mentioned above.
Revisiting Image-Language Networks for Open-Ended Phrase Detection
This paper addresses a more realistic version of the natural language grounding task where the authors must both identify whether the phrase is relevant to an image and localize the phrase, and proposes an approach that extends Faster R-CNN to relate image regions and phrases.


Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
This paper presents Flickr30K Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes.
From captions to visual concepts and back
This paper uses multiple instance learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives, and develops a maximum-entropy language model.
Visual Relationship Detection with Language Priors
This work proposes a model that can scale to predict thousands of types of relationships from a few examples and improves on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship.
MSRC: multimodal spatial regression with semantic context for phrase grounding
A novel multimodal spatial regression with semantic context (MSRC) system which not only predicts the location of ground truth based on proposal bounding boxes, but also refines prediction results by penalizing similarities of different queries coming from same sentences.
Structured Matching for Phrase Localization
A structured matching of phrases and regions that encourages the semantic relations between phrases to agree with the visual relations between regions that is formulated as a discrete optimization problem and relaxed to a linear program.
A Sentence Is Worth a Thousand Pixels
A holistic conditional random field model for semantic parsing which reasons jointly about which objects are present in the scene, their spatial extent as well as semantic segmentation, and employs text as wellAs image information as input is proposed.
Recognition using visual phrases
It is shown that a visual phrase detector significantly outperforms a baseline which detects component objects and reasons about relations, even though visual phrase training sets tend to be smaller than those for objects.
VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases
This work introduces the problem of visual verification of relation phrases and developed a Visual Knowledge Extraction system called VisKE, which has been used to not only enrich existing textual knowledge bases by improving their recall, but also augment open-domain question-answer reasoning.
What Are You Talking About? Text-to-Image Coreference
This paper proposes a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects.
Natural Language Object Retrieval
Experimental results demonstrate that the SCRC model effectively utilizes both local and global information, outperforming previous baseline methods significantly on different datasets and scenarios, and can exploit large scale vision and language datasets for knowledge transfer.