Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents

@inproceedings{Zhu2020DescribingUV,
  title={Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents},
  author={Ye Zhu and Yu Wu and Yi Yang and Yan Yan},
  booktitle={ECCV},
  year={2020}
}
With the arising concerns for the AI systems provided with direct access to abundant sensitive information, researchers seek to develop more reliable AI with implicit information sources. To this end, in this paper, we introduce a new task called video description via two multi-modal cooperative dialog agents, whose ultimate goal is for one conversational agent to describe an unseen video based on the dialog and two static frames. Specifically, one of the intelligent agents - Q-BOT - is given… 

Saying the Unseen: Video Descriptions via Dialog Agents

This work introduces a novel task that aims to describe a video using the natural language dialog between two agents as a supplementary information source given incomplete visual data and experimentally demonstrates the knowledge transfer process between the two dialog agents and the effectiveness of using thenatural language dialog as a supplement for incomplete implicit visions.

Supplementing Missing Visions via Dialog for Scene Graph Generations

A model-agnostic Supplementary Interactive Dialog framework that can be jointly learned with most existing models, endowing the current AI systems with the ability of question-answer interactions in natural language and achieving promising performance improvement over multiple baselines.

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

The results demonstrate that AVQA benefits from multisensory perception and the model outperforms recent A-, V-, andAVQA approaches.

Discrete Contrastive Diffusion for Cross-Modal and Conditional Generation

This work introduces a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process, and forms CDCD by connecting it with the conventional variational objectives.

Vision+X: A Survey on Multimodal Learning in the Light of Data

This paper analyzes the commonness and uniqueness of each data format ranging from vision, audio, text and others, and presents the technical development categorized by the combination of Vision+X, where the vision data play a fundamental role in most multimodal learning works.

A Metamodel and Framework for Artificial General Intelligence From Theory to Practice

One surprising consequence of the metamodel is that it enables a new level of autonomous learning and optimal functioning for machine intelligences, but may also shed light on a path to better understanding how to improve human cognition.

Win The Lottery Ticket Via Fourier Analysis: Frequencies Guided Network Pruning

This paper investigates the Magnitude-Based Pruning (MBP) scheme and analyzes it from a novel perspective through Fourier analysis on the deep learning model to guide model designation and proposes a novel two-stage pruning approach.

References

SHOWING 1-10 OF 54 REFERENCES

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

This work poses a cooperative ‘image guessing’ game between two agents who communicate in natural language dialog so that Q-BOT can select an unseen image from a lineup of images and shows the emergence of grounded language and communication among ‘visual’ dialog agents with no human supervision.

Visual Dialog

A retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response, and a family of neural encoder-decoder models, which outperform a number of sophisticated baselines.

Audio Visual Scene-Aware Dialog

The task of scene-aware dialog is introduced and results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.

End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features

  • Chiori HoriHuda AlAmri Devi Parikh
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
This paper introduces a new data set of dialogs about videos of human behaviors, as well as an end-to-end Audio Visual Scene-Aware Dialog (AVSD) model, trained using thisnew data set, that generates responses in a dialog about a video.

Two Can Play This Game: Visual Dialog with Discriminative Question Generation and Answering

A simple symmetric discriminative baseline is demonstrated that can be applied to both predicting an answer as well as predicting a question, and it is shown that this method performs on par with the state of the art, even memory net based methods.

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog

This paper presents a new model for visual dialog, Recurrent Dual Attention Network (ReDAN), using multi-step reasoning to answer a series of questions about an image, and demonstrates that ReDAN can locate context-relevant visual and textual clues via iterative refinement, which can lead to the correct answer step-by-step.

Are You Talking to Me? Reasoned Visual Dialog Generation Through Adversarial Learning

A novel approach that combines Reinforcement Learning and Generative Adversarial Networks (GANS) to generate more human-like responses to questions to overcome the relative paucity of training data, and the tendency of the typical MLE-based approach to generate overly terse answers.

Recursive Visual Attention in Visual Dialog

To resolve the visual co-reference for visual dialog, the proposed RvA not only outperforms the state-of-the-art methods, but also achieves reasonable recursion and interpretable attention maps without additional annotations.

FLIPDIAL: A Generative Model for Two-Way Visual Dialogue

This work presents FLIPDIAL, a generative model for Visual Dialogue that simultaneously plays the role of both participants in a visually-grounded dialogue, and is the first to extend this paradigm to full two-way visual dialogue (2VD), where the model is capable of generating both questions and answers in sequence based on a visual input.

Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog

Answerer in Questioner's Mind (AQM) is proposed, a novel information theoretic algorithm for goal-oriented dialog that a questioner asks and infers based on an approximated probabilistic model of the answerer.
...