Commonly Uncommon: Semantic Sparsity in Situation Recognition

@article{Yatskar2017CommonlyUS,
  title={Commonly Uncommon: Semantic Sparsity in Situation Recognition},
  author={Mark Yatskar and Vicente Ordonez and Luke Zettlemoyer and Ali Farhadi},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2017},
  pages={6335-6344}
}
Semantic sparsity is a common challenge in structured visual classification problems, when the output space is complex, the vast majority of the possible predictions are rarely, if ever, seen in the training set. This paper studies semantic sparsity in situation recognition, the task of producing structured summaries of what is happening in images, including activities, objects and the roles objects play within the activity. For this problem, we find empirically that most substructures required… Expand
Mixture-Kernel Graph Attention Network for Situation Recognition
  • M. Suhail, L. Sigal
  • Computer Science
  • 2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
TLDR
This paper proposes a novel mixture-kernel attention graph neural network (GNN) architecture that enables dynamic graph structure during training and inference, through the use of a graph attention mechanism, and context-aware interactions between role pairs. Expand
Graph neural network for situation recognition
TLDR
This work proposes a novel mixture-kernel attention graph neural network architecture that enables dynamic graph structure during training and inference, through the use of a graph attention mechanism, and context-aware interactions between role pairs, and alleviates semantic sparsity by representing graph kernels using a convex combination of learned basis. Expand
Grounded Situation Recognition
TLDR
A Joint Situation Localizer is proposed and it is found that jointly predicting situations and groundings with end-to-end training handily outperforms independent training on the entire grounding metric suite with relative gains between 8% and 32%. Expand
Relational graph neural network for situation recognition
TLDR
Experimental results show that the proposed RGNN outperforms the state-of-the-art methods on verb and value metrics, and demonstrates better relationships between the activity and the objects. Expand
Situation Recognition with Graph Neural Networks
TLDR
A model based on Graph Neural Networks is proposed that allows us to efficiently capture joint dependencies between roles using neural networks defined on a graph and significantly outperforms existing work, as well as multiple baselines. Expand
Attention-Based Context Aware Reasoning for Situation Recognition
TLDR
This work proposes the first set of methods to address inter-dependent queries in query-based visual reasoning, and improves upon a state-of-the-art method that answers queries separately. Expand
Weakly Supervised Visual Semantic Parsing
TLDR
A generalized formulation of SGG is proposed, namely Visual Semantic Parsing, which disentangles entity and predicate recognition, and enables sub-quadratic performance, and the first graph-based weakly supervised learning framework, based on a novel graph alignment algorithm, which enables training without bounding box annotations is proposed. Expand
Automatic generation of composite image descriptions
  • Chang Liu, A. Shmilovici, Mark Last
  • Computer Science
  • 2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD)
  • 2017
TLDR
This paper proposes a frame-based algorithm for generating a composite natural language description for a given image that contains on average 16% more visual elements than the baseline method and gains a significantly higher accuracy score by the human evaluators. Expand
MovieGraphs: Towards Understanding Human-Centric Situations from Videos
TLDR
MovieGraphs is the first benchmark to focus on inferred properties of human-centric situations, and opens up an exciting avenue towards socially-intelligent AI agents. Expand
Unsupervised and Semi-Supervised Image Classification With Weak Semantic Consistency
TLDR
A novel weak semantic consistency constrained image classification method that starts from an extreme circumstance by viewing each image as one class and conducts both unsupervised and semi-supervised experiments on several datasets. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 48 REFERENCES
DeViSE: A Deep Visual-Semantic Embedding Model
TLDR
This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training. Expand
YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition
TLDR
This paper presents a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object, and uses a Web-scale language model to ``fill in'' novel verbs. Expand
Predicting Deep Zero-Shot Convolutional Neural Networks Using Textual Descriptions
TLDR
A new model is presented that can classify unseen categories from their textual description and takes advantage of the architecture of CNNs and learn features at different layers, rather than just learning an embedding space for both modalities, as is common with existing approaches. Expand
Situation Recognition: Visual Semantic Role Labeling for Image Understanding
This paper introduces situation recognition, the problem of producing a concise summary of the situation an image depicts including: (1) the main activity (e.g., clipping), (2) the participatingExpand
Learning Everything about Anything: Webly-Supervised Visual Concept Learning
TLDR
A fully-automated approach for learning extensive models for a wide range of variations within any concept, which leverages vast resources of online books to discover the vocabulary of variance, and intertwines the data collection and modeling steps to alleviate the need for explicit human supervision in training the models. Expand
Learning to generalize to new compositions in image understanding
TLDR
It is argued that structured representations and compositional splits are a useful benchmark for image captioning, and advocate compositional models that capture linguistic and visual structure. Expand
Attribute-Based Classification for Zero-Shot Visual Object Categorization
We study the problem of object recognition for categories for which we have no training examples, a task also called zero--data or zero-shot learning. This situation has hardly been studied inExpand
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
TLDR
A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network. Expand
Visual Relationship Detection with Language Priors
TLDR
This work proposes a model that can scale to predict thousands of types of relationships from a few examples and improves on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship. Expand
Show and tell: A neural image caption generator
TLDR
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. Expand
...
1
2
3
4
5
...