• Corpus ID: 16069217

Answering Image Riddles using Vision and Reasoning through Probabilistic Soft Logic

  title={Answering Image Riddles using Vision and Reasoning through Probabilistic Soft Logic},
  author={Somak Aditya and Yezhou Yang and Chitta Baral and Yiannis Aloimonos},
In this work, we explore a genre of puzzles ("image riddles") which involves a set of images and a question. Answering these puzzles require both capabilities involving visual detection (including object, activity recognition) and, knowledge-based or commonsense reasoning. We compile a dataset of over 3k riddles where each riddle consists of 4 images and a groundtruth answer. The annotations are validated using crowd-sourced evaluation. We also define an automatic evaluation metric to track… 
Explainable Image Understanding Using Vision and Reasoning
This work looks towards efficiently using probabilistic knowledge, encoded in KnowledgeBases that encode (probabilistic) commonsense and background knowledge on top of machine learning capabilities, to rectify noise in visual detections and generate captions or answers to posed questions.
Image Understanding using vision and reasoning through Scene Description Graph
A general architecture is proposed in which a system can represent both the content and underlying concepts of an image using an SDG, and it is proposed that the extracted graphs capture syntactic and semantic content of images with reasonable accuracy.
How a General-Purpose Commonsense Ontology can Improve Performance of Learning-Based Image Retrieval
This work studies how general-purpose ontologies---specifically, MIT's ConceptNet ontology---can improve the performance of state-of-the-art vision systems and shows that ConceptNet can improve performance on a common benchmark dataset.
Robot Sequential Decision Making using LSTM-based Learning and Logical-probabilistic Reasoning
This work aims at a robot SDM framework that exploits the complementary features of learning, reasoning, and planning, and utilizes long short-term memory, for passive state estimation with streaming sensor data, and commonsense reasoning and probabilistic planning for active information collection and task accomplishment.


Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we propose
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language
What Value Do Explicit High Level Concepts Have in Vision to Language Problems?
A method of incorporating high-level concepts into the successful CNN-RNN approach is proposed, and it is shown that it achieves a significant improvement on the state-of-the-art in both image captioning and visual question answering.
Microsoft COCO: Common Objects in Context
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene
Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question
The mQA model, which is able to answer questions about the content of an image, is presented, which contains four components: a Long Short-Term Memory (LSTM), a Convolutional Neural Network (CNN), an LSTM for storing the linguistic context in an answer, and a fusing component to combine the information from the first three components and generate the answer.
Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation
This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.
Learning Hierarchical Features for Scene Labeling
A method that uses a multiscale convolutional network trained from raw pixels to extract dense feature vectors that encode regions of multiple sizes centered on each pixel, alleviates the need for engineered features, and produces a powerful representation that captures texture, shape, and contextual information.
Learning to Answer Questions from Image Using Convolutional Neural Network
The proposed CNN provides an end-to-end framework with convolutional architectures for learning not only the image and question representations, but also their inter-modal interactions to produce the answer.
Deep Residual Learning for Image Recognition
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Collective Activity Detection Using Hinge-loss Markov Random Fields
This work applies hinge-loss Markov random fields (HL-MRFs), a powerful class of continuous-valued graphical models, to the task of activity detection, using principles of collective classification and shows that it achieves significant lift over the low-level detectors.