Situation Recognition: Visual Semantic Role Labeling for Image Understanding

@article{Yatskar2016SituationRV,
  title={Situation Recognition: Visual Semantic Role Labeling for Image Understanding},
  author={Mark Yatskar and Luke Zettlemoyer and Ali Farhadi},
  journal={2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2016},
  pages={5534-5542}
}
This paper introduces situation recognition, the problem of producing a concise summary of the situation an image depicts including: (1) the main activity (e.g., clipping), (2) the participating actors, objects, substances, and locations (e.g., man, shears, sheep, wool, and field) and most importantly (3) the roles these participants play in the activity (e.g., the man is clipping, the shears are his tool, the wool is being clipped from the sheep, and the clipping is in a field). We use… Expand
Grounded Situation Recognition
TLDR
A Joint Situation Localizer is proposed and it is found that jointly predicting situations and groundings with end-to-end training handily outperforms independent training on the entire grounding metric suite with relative gains between 8% and 32%. Expand
Automatic generation of composite image descriptions
  • Chang Liu, A. Shmilovici, Mark Last
  • Computer Science
  • 2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)
  • 2017
TLDR
This paper proposes a frame-based algorithm for generating a composite natural language description for a given image that contains on average 16% more visual elements than the baseline method and gains a significantly higher accuracy score by the human evaluators. Expand
Mixture-Kernel Graph Attention Network for Situation Recognition
  • M. Suhail, L. Sigal
  • Computer Science
  • 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
TLDR
This paper proposes a novel mixture-kernel attention graph neural network (GNN) architecture that enables dynamic graph structure during training and inference, through the use of a graph attention mechanism, and context-aware interactions between role pairs. Expand
Situation Recognition with Graph Neural Networks
TLDR
A model based on Graph Neural Networks is proposed that allows us to efficiently capture joint dependencies between roles using neural networks defined on a graph and significantly outperforms existing work, as well as multiple baselines. Expand
Grounding Semantic Roles in Images
TLDR
This work renders candidate participants as image regions of objects, and trains a model which learns to ground roles in the regions which depict the corresponding participant, and induces frame—semantic visual representations. Expand
Graph neural network for situation recognition
TLDR
This work proposes a novel mixture-kernel attention graph neural network architecture that enables dynamic graph structure during training and inference, through the use of a graph attention mechanism, and context-aware interactions between role pairs, and alleviates semantic sparsity by representing graph kernels using a convex combination of learned basis. Expand
Semantic Image Retrieval via Active Grounding of Visual Situations
We describe a novel architecture for semantic image retrieval—in particular, retrieval of instances of visual situations. Visual situations are concepts such as “a boxing match,” “walking the dog,”Expand
Relational graph neural network for situation recognition
TLDR
Experimental results show that the proposed RGNN outperforms the state-of-the-art methods on verb and value metrics, and demonstrates better relationships between the activity and the objects. Expand
Active Grounding of Visual Situations
We address a key problem for computer vision: retrieving images that are instances of visual situations. Visual situations are concepts such as “a boxing match”, “a birthday party”, “walking theExpand
Interpreting Context of Images Using Scene Graphs
TLDR
This project delivers a model which focuses on finding the context present in an image by representing the image as a graph, where the nodes will the objects and edges will be the relation between them. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 58 REFERENCES
Visual Semantic Role Labeling
TLDR
The problem of Visual Semantic Role Labeling is introduced: given an image the authors want to detect people doing actions and localize the objects of interaction and associate objects in the scene with different semantic roles for each action. Expand
Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics (Extended Abstract)
TLDR
This work proposes to frame sentence-based image annotation as the task of ranking a given pool of captions, and introduces a new benchmark collection, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events. Expand
YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition
TLDR
This paper presents a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object, and uses a Web-scale language model to ``fill in'' novel verbs. Expand
Visual Madlibs: Fill in the blank Image Generation and Question Answering
TLDR
A new dataset consisting of 360,001 focused natural language descriptions for 10,738 images is introduced and its applicability to two new description generation tasks: focused description generation, and multiple-choice question-answering for images is demonstrated. Expand
Microsoft COCO: Common Objects in Context
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of sceneExpand
Show and tell: A neural image caption generator
TLDR
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. Expand
Actions in context
TLDR
This paper automatically discover relevant scene classes and their correlation with human actions, and shows how to learn selected scene classes from video without manual supervision and develops a joint framework for action and scene recognition and demonstrates improved recognition of both in natural video. Expand
Grouplet: A structured image representation for recognizing human and object interactions
  • B. Yao, Li Fei-Fei
  • Computer Science
  • 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
  • 2010
TLDR
It is shown that grouplets are more effective in classifying and detecting human-object interactions than other state-of-the-art methods and can make a robust distinction between humans playing the instruments and humans co-occurring with the instruments without playing. Expand
What Are You Talking About? Text-to-Image Coreference
TLDR
This paper proposes a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects. Expand
CIDEr: Consensus-based image description evaluation
TLDR
A novel paradigm for evaluating image descriptions that uses human consensus is proposed and a new automated metric that captures human judgment of consensus better than existing metrics across sentences generated by various sources is evaluated. Expand
...
1
2
3
4
5
...