Visual Dialog

@article{Das2017VisualD,
  title={Visual Dialog},
  author={Abhishek Das and Satwik Kottur and Khushi Gupta and Avi Singh and Deshraj Yadav and Jos{\'e} M. F. Moura and Devi Parikh and Dhruv Batra},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2017},
  pages={1080-1089}
}
We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being… Expand
Dual Attention Networks for Visual Reference Resolution in Visual Dialog
TLDR
This paper proposes Dual Attention Networks (DAN) for visual reference resolution, a model that consists of two kinds of attention networks, REFER and FIND, which outperforms the previous state-of-the-art model by a significant margin. Expand
CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog
TLDR
This work develops CLEVR-Dialog, a large diagnostic dataset for studying multi-round reasoning in visual dialog, and constructs a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset, resulting in a dataset where all aspects of the visual dialog are fully annotated. Expand
Cognitive Attention Network (CAN) for Text and Image Multimodal Visual Dialog Systems
Visual question answering and visual dialog systems are the emerging research areas in natural language processing that exploits the use of image and text modalities to convey an understanding of theExpand
Transfer learning for multimodal dialog
TLDR
This paper transfers the algorithmic approach, models, and data from this background corpus of 2000 h of how-to videos to the AVSD task, and reports the findings. Expand
Visual Dialog with Multi-turn Attentional Memory Network
TLDR
A attentional memory network is proposed that maintains image regions and historical dialog in two memory banks and attends the question to be answered to both the visual and textual banks to obtain multi-model facts. Expand
Saying the Unseen: Video Descriptions via Dialog Agents
  • Ye Zhu, Yu Wu, Yi Yang, Yan Yan
  • Computer Science, Medicine
  • IEEE transactions on pattern analysis and machine intelligence
  • 2021
TLDR
A novel task that aims to describe a video using the natural language dialog between two agents as supplementary information source given incomplete visual data and the proposed QA-Cooperative networks is introduced. Expand
SeqDialN: Sequential Visual Dialog Network in Joint Visual-Linguistic Representation Space
TLDR
This work forms a visual dialog as an information flow in which each piece of information is encoded with the joint visual-linguistic representation of a single dialog round, and proposes two Sequential Dialog Networks (SeqDialN) for inference and featurization. Expand
Making History Matter: History-Advantage Sequence Training for Visual Dialog
TLDR
This work intentionally imposes wrong answers in the history, obtaining an adverse critic, and sees how the historic error impacts the codec’s future behavior by History Advantage — a quantity obtained by subtracting the adverse critic from the gold reward of ground-truth history. Expand
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
TLDR
This work poses a cooperative ‘image guessing’ game between two agents who communicate in natural language dialog so that Q-BOT can select an unseen image from a lineup of images and shows the emergence of grounded language and communication among ‘visual’ dialog agents with no human supervision. Expand
Reasoning Over History: Context Aware Visual Dialog
TLDR
The MAC network architecture is extended with Context-aware Attention and Memory (CAM), which attends over control states in past dialog turns to determine the necessary reasoning operations for the current question. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 100 REFERENCES
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
TLDR
This work poses a cooperative ‘image guessing’ game between two agents who communicate in natural language dialog so that Q-BOT can select an unseen image from a lineup of images and shows the emergence of grounded language and communication among ‘visual’ dialog agents with no human supervision. Expand
FLIPDIAL: A Generative Model for Two-Way Visual Dialogue
TLDR
This work presents FLIPDIAL, a generative model for Visual Dialogue that simultaneously plays the role of both participants in a visually-grounded dialogue, and is the first to extend this paradigm to full two-way visual dialogue (2VD), where the model is capable of generating both questions and answers in sequence based on a visual input. Expand
Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images
We address a question answering task on real-world images that is set up as a Visual Turing Test. By combining latest advances in image representation and natural language processing, we proposeExpand
Visual7W: Grounded Question Answering in Images
TLDR
A semantic link between textual descriptions and image regions by object-level grounding enables a new type of QA with visual answers, in addition to textual answers used in previous work, and proposes a novel LSTM model with spatial attention to tackle the 7W QA tasks. Expand
End-to-end optimization of goal-driven and visually grounded dialogue systems
TLDR
This paper introduces a Deep Reinforcement Learning method to optimize visually grounded task-oriented dialogues, based on the policy gradient algorithm, which provides encouraging results at solving both the problem of generating natural dialogues and the task of discovering a specific object in a complex picture. Expand
Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences
TLDR
This work introduces a multi-level aligner that empowers the alignment-based encoder-decoder model with long short-term memory recurrent neural networks (LSTM-RNN) to translate natural language instructions to action sequences based upon a representation of the observable world state. Expand
Yin and Yang: Balancing and Answering Binary Visual Questions
TLDR
This paper addresses binary Visual Question Answering on abstract scenes as visual verification of concepts inquired in the questions by converting the question to a tuple that concisely summarizes the visual concept to be detected in the image. Expand
Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions
TLDR
These approaches, based on LSTM-RNNs, VQA model uncertainty, and caption-question similarity, are able to outperform strong baselines on both relevance tasks and are shown to be more intelligent, reasonable, and human-like than previous approaches. Expand
VQA: Visual Question Answering
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural languageExpand
Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model
TLDR
A novel training framework for neural sequence models, particularly for grounded dialog generation, that leverages the recently proposed Gumbel-Softmax approximation to the discrete distribution, and introduces a stronger encoder for visual dialog, and employs a self-attention mechanism for answer encoding. Expand
...
1
2
3
4
5
...