Recurrent Models for Situation Recognition

  title={Recurrent Models for Situation Recognition},
  author={Arun Mallya and Svetlana Lazebnik},
  journal={2017 IEEE International Conference on Computer Vision (ICCV)},
  • Arun Mallya, S. Lazebnik
  • Published 18 March 2017
  • Computer Science
  • 2017 IEEE International Conference on Computer Vision (ICCV)
This work proposes Recurrent Neural Network (RNN) models to predict structured ‘image situations’ – actions and noun entities fulfilling semantic roles related to the action. In contrast to prior work relying on Conditional Random Fields (CRFs), we use a specialized action prediction network followed by an RNN for noun prediction. Our system obtains state-of-the-art accuracy on the challenging recent imSitu dataset, beating CRF-based models, including ones trained with additional data. Further… 

Figures and Tables from this paper

Graph neural network for situation recognition
This work proposes a novel mixture-kernel attention graph neural network architecture that enables dynamic graph structure during training and inference, through the use of a graph attention mechanism, and context-aware interactions between role pairs, and alleviates semantic sparsity by representing graph kernels using a convex combination of learned basis.
Grounded Situation Recognition with Transformers
The attention mechanism of the model enables accurate verb classification by capturing high-level semantic feature of an image effectively, and allows the model to flexibly deal with the complicated and image-dependent relations between entities for improved noun classification and localization.
Mixture-Kernel Graph Attention Network for Situation Recognition
  • M. Suhail, L. Sigal
  • Computer Science
    2019 IEEE/CVF International Conference on Computer Vision (ICCV)
  • 2019
This paper proposes a novel mixture-kernel attention graph neural network (GNN) architecture that enables dynamic graph structure during training and inference, through the use of a graph attention mechanism, and context-aware interactions between role pairs.
Attention-Based Context Aware Reasoning for Situation Recognition
This work proposes the first set of methods to address inter-dependent queries in query-based visual reasoning, and improves upon a state-of-the-art method that answers queries separately.
Grounded Situation Recognition
A Joint Situation Localizer is proposed and it is found that jointly predicting situations and groundings with end-to-end training handily outperforms independent training on the entire grounding metric suite with relative gains between 8% and 32%.
Collaborative Transformers for Grounded Situation Recognition
A novel approach where the two processes for activity classification and entity estimation are interactive and complementary, which achieves the state of the art in all evaluation metrics on the SWiG dataset.
Grounding Semantic Roles in Images
This work renders candidate participants as image regions of objects, and trains a model which learns to ground roles in the regions which depict the corresponding participant, and induces frame—semantic visual representations.
Rethinking the Two-Stage Framework for Grounded Situation Recognition
A novel SituFormer for GSR which consists of a Coarse-toFine Verb Model (CFVM) and a Transformer-based Noun Model (TNM), which is a transformer-based semantic role detection model, which detects all roles parallelly.
Convolutional Image Captioning
This paper develops a convolutional image captioning technique that demonstrates efficacy on the challenging MSCOCO dataset and demonstrates performance on par with the LSTM baseline, while having a faster training time per number of parameters.


Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering
This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two
Human action recognition by learning bases of action attributes and parts
This work proposes to use attributes and parts for recognizing human actions in still images by learning a set of sparse bases that are shown to carry much semantic meaning, and shows that this dual sparsity provides theoretical guarantee of the bases learning and feature reconstruction approach.
Long-term recurrent convolutional networks for visual recognition and description
A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Sequence to Sequence Learning with Neural Networks
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Recurrent Neural Network Regularization
This paper shows how to correctly apply dropout to LSTMs, and shows that it substantially reduces overfitting on a variety of tasks.
Language Models for Image Captioning: The Quirks and What Works
By combining key aspects of the ME and RNN methods, this paper achieves a new record performance over previously published results on the benchmark COCO dataset, however, the gains the authors see in BLEU do not translate to human judgments.
Structured Attention Networks
This work shows that structured attention networks are simple extensions of the basic attention procedure, and that they allow for extending attention beyond the standard soft-selection approach, such as attending to partial segmentations or to subtrees.
Boosting Image Captioning with Attributes
This paper presents Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner.
Guiding the Long-Short Term Memory Model for Image Caption Generation
In this work we focus on the problem of image caption generation. We propose an extension of the long short term memory (LSTM) model, which we coin gLSTM for short. In particular, we add semantic
Show and tell: A neural image caption generator
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.