• Corpus ID: 216080825

Visual Commonsense Graphs: Reasoning about the Dynamic Context of a Still Image

@article{Park2020VisualCG,
  title={Visual Commonsense Graphs: Reasoning about the Dynamic Context of a Still Image},
  author={Jae Sung Park and Chandra Bhagavatula and Roozbeh Mottaghi and Ali Farhadi and Yejin Choi},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.10796}
}
Even from a single frame of a still image, people can reason about the dynamic story of the image before, after, and beyond the frame. For example, given an image of a man struggling to stay afloat in water, we can reason that the man fell into the water sometime in the past, the intent of that man at the moment is to stay alive, and he will need help in the near future or else he will get washed away. We propose VisualComet, the novel framework of visual commonsense reasoning tasks to predict… 
Two Heads are Better Than One: Hypergraph-Enhanced Graph Reasoning for Visual Event Ratiocination
TLDR
A novel multimodal model that represents the contents from the same modality as a semantic graph and mines the intra-modality relationship, therefore breaking the limitations in the spatial domain and illustrates the case of “two heads are better than one” in the sense that semantic graph representations with the help of the proposed enhancement mechanism are more robust than those without.
Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
TLDR
This study presents RationaleˆVT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs, and finds that integration of richer semantic and pragmatic visual features improves visual fidelity of rationales.
What Is More Likely to Happen Next? Video-and-Language Future Event Prediction
TLDR
This work collects a new dataset, named Video-and-Language Event Prediction (VLEP), with 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog video clips, and presents a strong baseline incorporating information from video, dialogue, and commonsense knowledge.

References

SHOWING 1-10 OF 55 REFERENCES
From Recognition to Cognition: Visual Commonsense Reasoning
TLDR
To move towards cognition-level understanding, a new reasoning engine is presented, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning.
Learning Common Sense through Visual Abstraction
TLDR
The use of human-generated abstract scenes made from clipart for learning common sense is explored and it is shown that the commonsense knowledge the authors learn is complementary to what can be learnt from sources of text.
Visual Dialog
TLDR
A retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response, and a family of neural encoder-decoder models, which outperform a number of sophisticated baselines.
Inferring the Why in Images
TLDR
The results suggest that transferring knowledge from language into vision can help machines understand why a person might be performing an action in an image, and recently developed natural language models to mine knowledge stored in massive amounts of text.
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge
TLDR
This paper addresses the task of knowledge-based visual question answering and provides a benchmark, called OK-VQA, where the image content is not sufficient to answer the questions, encouraging methods that rely on external knowledge resources.
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
TLDR
This paper presents Flickr30K Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes.
Show and tell: A neural image caption generator
TLDR
This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image.
ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning
TLDR
Experimental results demonstrate that multitask models that incorporate the hierarchical structure of if-then relation types lead to more accurate inference compared to models trained in isolation, as measured by both automatic and human evaluation.
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
TLDR
A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.
...
1
2
3
4
5
...