IQA: Visual Question Answering in Interactive Environments

@article{Gordon2018IQAVQ,
  title={IQA: Visual Question Answering in Interactive Environments},
  author={Daniel Gordon and Aniruddha Kembhavi and Mohammad Rastegari and Joseph Redmon and D. Fox and Ali Farhadi},
  journal={2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2018},
  pages={4089-4098}
}
We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. [...] Key Method We propose the Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction.Expand

Paper Mentions

News Article
VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering
TLDR
The VideoNavQA dataset is built, which contains pairs of questions and videos generated in the House3D environment and establishes an initial understanding of how well VQA-style methods can perform within this novel EQA paradigm. Expand
Multi-agent Embodied Question Answering in Interactive Environments
TLDR
This work investigates a new AI task — Multi-Agent Interactive Question Answering — where several agents explore the scene jointly in interactive environments to answer a question, and proposes a question answering model built upon a 3D-CNN network to encode the scene memories. Expand
Embodied Question Answering
TLDR
A new AI task where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'), and the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric) vision, and then answer the question. Expand
Embodied Question Answering
TLDR
A new AI task where an agent is spawned at a random location in a 3D environment and asked a question ('What color is the car?'), and the agent must first intelligently navigate to explore the environment, gather necessary visual information through first-person (egocentric) vision, and then answer the question. Expand
Visual Experience-Based Question Answering with Complex Multimodal Environments
TLDR
A hybrid visual question answering system, VQAS, integrating a deep neural network-based scene graph generation model and a rule-based knowledge reasoning system is proposed that can generate more accurate scene graphs for dynamic environments with some uncertainty. Expand
Interactive Language Learning by Question Answering
TLDR
This work proposes and evaluates a set of baseline models for the QAit task that includes deep reinforcement learning agents, and shows that the task presents a major challenge for machine reading systems, while humans solve it with relative ease. Expand
Agent with the Big Picture: Perceiving Surroundings for Interactive Instruction Following
We address the interactive instruction following task [4, 9, 8] which requires an agent to navigate through an environment, interact with objects, and complete long-horizon tasks, following naturalExpand
Revisiting EmbodiedQA: A Simple Baseline and Beyond
TLDR
A simple yet effective baseline that achieves promising performance; an easier and practical setting for EmbodiedQA where an agent has a chance to adapt the trained model to a new environment before it actually answers users questions; and a small change in settings yields a notable gain in navigation. Expand
Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition
TLDR
This work develops an agent empowered with visual curiosity, i.e. the ability to ask questions to an Oracle and build visual recognition model based on the answers received, and proposes a novel framework and formulate the learning of visual curiosity as a reinforcement learning problem. Expand
BERT Representations for Video Question Answering
TLDR
This work proposes to use BERT, a sequential modelling technique based on Transformers, to encode the complex semantics from video clips to capture the visual and language information of a video scene by encoding not only the subtitles but also a sequence of visual concepts with a pretrained language-based Transformer. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 116 REFERENCES
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks. Expand
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
TLDR
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks. Expand
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural languageExpand
Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
TLDR
The Spatial Memory Network, a novel spatial attention architecture that aligns words with image patches in the first hop, is proposed and improved results are obtained compared to a strong deep baseline model which concatenates image and question features to predict the answer. Expand
Target-driven visual navigation in indoor scenes using deep reinforcement learning
TLDR
This paper proposes an actor-critic model whose policy is a function of the goal as well as the current state, which allows better generalization and proposes the AI2-THOR framework, which provides an environment with high-quality 3D scenes and a physics engine. Expand
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we proposeExpand
Visual question answering: A survey of methods and datasets
TLDR
The state of the art by comparing modern approaches to VQA, and the common approach of combining convolutional and recurrent neural networks to map images and questions to a common feature space are examined. Expand
DeepStory: Video Story QA by Deep Embedded Memory Networks
TLDR
A video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data to outperform other QA models. Expand
Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources
TLDR
A method for visual question answering which combines an internal representation of the content of an image with information extracted from a general knowledge base to answer a broad range of image-based questions and is specifically able to answer questions posed in natural language, that refer to information not contained in the image. Expand
Hierarchical Question-Image Co-Attention for Visual Question Answering
TLDR
This paper presents a novel co-attention model for VQA that jointly reasons about image and question attention in a hierarchical fashion via a novel 1-dimensional convolution neural networks (CNN). Expand
...
1
2
3
4
5
...