Corpus ID: 3158329

A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input

@inproceedings{Malinowski2014AMA,
  title={A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input},
  author={Mateusz Malinowski and Mario Fritz},
  booktitle={NIPS},
  year={2014}
}
We propose a method for automatically answering questions about images by bringing together recent advances from natural language processing and computer vision. We combine discrete reasoning with uncertain predictions by a multi-world approach that represents uncertainty about the perceived world in a bayesian framework. Our approach can handle human questions of high complexity about realistic scenes and replies with range of answer like counts, object classes, instances and lists of them… Expand
Explicit Knowledge-based Reasoning for Visual Question Answering
TLDR
A method for visual question answering which is capable of reasoning about contents of an image on the basis of information extracted from a large-scale knowledge base is described, addressing one of the key issues in general visual answering. Expand
A Bayesian approach to Visual Question Answering
Visual question answering (VQA) is a complex task involving perception and reasoning. Typical approaches which use black-box neural networks, do well on perception but fail to generalize due to lackExpand
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we proposeExpand
Object-Based Reasoning in VQA
TLDR
A solution combining state-of-the-art object detection and reasoning modules for VQA, achieved on the well-balanced CLEVR dataset, and shows significant, few percent improvements of accuracy on the complex "counting" task. Expand
Answer-Type Prediction for Visual Question Answering
TLDR
This paper builds a system capable of answering open-ended text-based questions about images, known as Visual Question Answering (VQA), which can predict the form of the answer from the question in a Bayesian framework. Expand
Survey of Recent Advances in Visual Question Answering
TLDR
The paper describes the approaches taken by various algorithms to extract image features, text features and the way these are employed to predict answers and briefly discusses the experiments performed to evaluate the VQA models. Expand
Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions
TLDR
This paper proposes a hybrid model which integrates a physics engine into a question answering architecture in order to anticipate future scene states resulting from object-object interactions caused by an action. Expand
Semantic Parsing to Probabilistic Programs for Situated Question Answering
TLDR
P3 is presented, a novel situated question answering model that can use background knowledge and global features of the question/environment interpretation while retaining efficient approximate inference, to treat semantic parses as probabilistic programs that execute nondeterministically and whose possible executions represent environmental uncertainty. Expand
Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering
TLDR
This work develops an entity graph and uses a graph convolutional network to `reason' about the correct answer by jointly considering all entities and shows that this leads to an improvement in accuracy of around 7% compared to the state of the art. Expand
Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering
TLDR
A learning-based approach which goes straight to the facts via a learned embedding space is developed and demonstrated state-of-the-art results on the challenging recently introduced fact-based visual question answering dataset are demonstrated. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 30 REFERENCES
What Are You Talking About? Text-to-Image Coreference
TLDR
This paper proposes a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects. Expand
A Joint Model of Language and Perception for Grounded Attribute Learning
TLDR
This work presents an approach for joint learning of language and perception models for grounded attribute induction, which includes a language model based on a probabilistic categorial grammar that enables the construction of compositional meaning representations. Expand
Learning Dependency-Based Compositional Semantics
TLDR
A new semantic formalism, dependency-based compositional semantics (DCS) is developed and a log-linear distribution over DCS logical forms is defined and it is shown that the system obtains comparable accuracies to even state-of-the-art systems that do require annotated logical forms. Expand
Jointly Learning to Parse and Perceive: Connecting Natural Language to the Physical World
TLDR
This paper introduces Logical Semantics with Perception (LSP), a model for grounded language acquisition that learns to map natural language statements to their referents in a physical environment and finds that LSP outperforms existing, less expressive models that cannot represent relational language. Expand
YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition
TLDR
This paper presents a solution that takes a short video clip and outputs a brief sentence that sums up the main activity in the video, such as the actor, the action and its object, and uses a Web-scale language model to ``fill in'' novel verbs. Expand
Deep Fragment Embeddings for Bidirectional Image Sentence Mapping
TLDR
This work introduces a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data and introduces a structured max-margin objective that allows this model to explicitly associate fragments across modalities. Expand
Grounding spatial relations for human-robot interaction
TLDR
A system for human-robot interaction that learns both models for spatial prepositions and for object recognition, and grounds the meaning of an input sentence in terms of visual percepts coming from the robot's sensors to send an appropriate command to the PR2 or respond to spatial queries. Expand
Learning to Follow Navigational Directions
TLDR
A system that learns to follow navigational natural language directions by learning by apprenticeship from routes through a map paired with English descriptions using a reinforcement learning algorithm, which grounds the meaning of spatial terms like above and south into geometric properties of paths. Expand
Understanding Natural Language Commands for Robotic Navigation and Mobile Manipulation
TLDR
This paper describes a new model for understanding natural language commands given to autonomous systems that perform navigation and mobile manipulation in semi-structured environments that dynamically instantiates a probabilistic graphical model for a particular natural language command according to the command's hierarchical and compositional semantic structure. Expand
Image Retrieval with Structured Object Queries Using Latent Ranking SVM
TLDR
This work develops a learning framework to jointly consider object classes and their relations in image retrieval with structured object queries --- queries that specify the objects that should be present in the scene, and their spatial relations. Expand
...
1
2
3
...