Recurrent Models for Situation Recognition
@article{Mallya2017RecurrentMF, title={Recurrent Models for Situation Recognition}, author={Arun Mallya and Svetlana Lazebnik}, journal={2017 IEEE International Conference on Computer Vision (ICCV)}, year={2017}, pages={455-463} }
This work proposes Recurrent Neural Network (RNN) models to predict structured ‘image situations’ – actions and noun entities fulfilling semantic roles related to the action. In contrast to prior work relying on Conditional Random Fields (CRFs), we use a specialized action prediction network followed by an RNN for noun prediction. Our system obtains state-of-the-art accuracy on the challenging recent imSitu dataset, beating CRF-based models, including ones trained with additional data. Further…
Figures and Tables from this paper
21 Citations
Graph neural network for situation recognition
- Computer Science
- 2019
This work proposes a novel mixture-kernel attention graph neural network architecture that enables dynamic graph structure during training and inference, through the use of a graph attention mechanism, and context-aware interactions between role pairs, and alleviates semantic sparsity by representing graph kernels using a convex combination of learned basis.
Grounded Situation Recognition with Transformers
- Computer ScienceBMVC
- 2021
The attention mechanism of the model enables accurate verb classification by capturing high-level semantic feature of an image effectively, and allows the model to flexibly deal with the complicated and image-dependent relations between entities for improved noun classification and localization.
Mixture-Kernel Graph Attention Network for Situation Recognition
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
This paper proposes a novel mixture-kernel attention graph neural network (GNN) architecture that enables dynamic graph structure during training and inference, through the use of a graph attention mechanism, and context-aware interactions between role pairs.
Attention-Based Context Aware Reasoning for Situation Recognition
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This work proposes the first set of methods to address inter-dependent queries in query-based visual reasoning, and improves upon a state-of-the-art method that answers queries separately.
Grounded Situation Recognition
- Computer ScienceECCV
- 2020
A Joint Situation Localizer is proposed and it is found that jointly predicting situations and groundings with end-to-end training handily outperforms independent training on the entire grounding metric suite with relative gains between 8% and 32%.
Collaborative Transformers for Grounded Situation Recognition
- Computer ScienceArXiv
- 2022
A novel approach where the two processes for activity classification and entity estimation are interactive and complementary, which achieves the state of the art in all evaluation metrics on the SWiG dataset.
Grounding Semantic Roles in Images
- Computer Science, PsychologyEMNLP
- 2018
This work renders candidate participants as image regions of objects, and trains a model which learns to ground roles in the regions which depict the corresponding participant, and induces frame—semantic visual representations.
Rethinking the Two-Stage Framework for Grounded Situation Recognition
- Computer ScienceArXiv
- 2021
A novel SituFormer for GSR which consists of a Coarse-toFine Verb Model (CFVM) and a Transformer-based Noun Model (TNM), which is a transformer-based semantic role detection model, which detects all roles parallelly.
Convolutional Image Captioning
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
This paper develops a convolutional image captioning technique that demonstrates efficacy on the challenging MSCOCO dataset and demonstrates performance on par with the LSTM baseline, while having a faster training time per number of parameters.
References
SHOWING 1-10 OF 41 REFERENCES
Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering
- Computer ScienceECCV
- 2016
This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two…
Human action recognition by learning bases of action attributes and parts
- Computer Science2011 International Conference on Computer Vision
- 2011
This work proposes to use attributes and parts for recognizing human actions in still images by learning a set of sparse bases that are shown to carry much semantic meaning, and shows that this dual sparsity provides theoretical guarantee of the bases learning and feature reconstruction approach.
Long-term recurrent convolutional networks for visual recognition and description
- Computer Science2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2015
A novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and shows such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.
Sequence to Sequence Learning with Neural Networks
- Computer ScienceNIPS
- 2014
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.
Recurrent Neural Network Regularization
- Computer ScienceArXiv
- 2014
This paper shows how to correctly apply dropout to LSTMs, and shows that it substantially reduces overfitting on a variety of tasks.
Language Models for Image Captioning: The Quirks and What Works
- Computer ScienceACL
- 2015
By combining key aspects of the ME and RNN methods, this paper achieves a new record performance over previously published results on the benchmark COCO dataset, however, the gains the authors see in BLEU do not translate to human judgments.
Structured Attention Networks
- Computer ScienceICLR
- 2017
This work shows that structured attention networks are simple extensions of the basic attention procedure, and that they allow for extending attention beyond the standard soft-selection approach, such as attending to partial segmentations or to subtrees.
Boosting Image Captioning with Attributes
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
This paper presents Long Short-Term Memory with Attributes (LSTM-A) - a novel architecture that integrates attributes into the successful Convolutional Neural Networks plus Recurrent Neural Networks (RNNs) image captioning framework, by training them in an end-to-end manner.
Guiding the Long-Short Term Memory Model for Image Caption Generation
- Computer Science2015 IEEE International Conference on Computer Vision (ICCV)
- 2015
In this work we focus on the problem of image caption generation. We propose an extension of the long short term memory (LSTM) model, which we coin gLSTM for short. In particular, we add semantic…
Show and tell: A neural image caption generator
- Computer Science2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2015
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.