Corpus ID: 208291465

Efficient Attention Mechanism for Handling All the Interactions between Many Inputs with Application to Visual Dialog

@article{Nguyen2019EfficientAM,
  title={Efficient Attention Mechanism for Handling All the Interactions between Many Inputs with Application to Visual Dialog},
  author={Van-Quang Nguyen and M. Suganuma and Takayuki Okatani},
  journal={ArXiv},
  year={2019},
  volume={abs/1911.11390}
}
It has been a primary concern in recent studies of vision and language tasks to design an effective attention mechanism dealing with interactions between the two modalities. The Transformer has recently been extended and applied to several bi-modal tasks, yielding promising results. For visual dialog, it becomes necessary to consider interactions between three or more inputs, i.e., an image, a question, and a dialog history, or even its individual dialog components. In this paper, we present a… Expand
DialGraph: Sparse Graph Learning Networks for Visual Dialog
TLDR
This paper formulate the visual dialog tasks as graph structure learning tasks as Sparse Graph Learning Networks (SGLNs) consisting of a multimodal node embedding module and a sparse graph learning module that outperforms the state-of-the-art approaches on the VisDial v1.0 dataset. Expand
Multimodal Fusion of Visual Dialog: A Survey
TLDR
A comprehensive survey of the recent achievements in the Visual Dialog task and many aspects of multimodal fusion research: Visual Co-reference Resolution, Attention Mechanism, Graph Neural Networks, evaluation issues, specifically benchmark datasets, evaluation metrics, and state of the art performance are provided. Expand
Co-Attentional Transformers for Story-Based Video Understanding
TLDR
A novel co-attentional transformer model is proposed to better capture long-term dependencies seen in visual stories such as dramas and is evaluated on the recently introduced DramaQA dataset which features character-centered video story understanding questions. Expand
VD-BERT: A Unified Vision and Dialog Transformer with BERT
TLDR
This work proposes VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks, and adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Expand
Ensemble of MRR and NDCG models for Visual Dialog
TLDR
A two-step non-parametric ranking approach that can merge strong MRR and NDCG models and won the recent Visual Dialog 2020 challenge. Expand
Multi-View Attention Networks for Visual Dialog
TLDR
This paper proposes Multi-View Attention Network (MVAN), which considers complementary views of multimodal inputs based on attention mechanisms, and effectively captures the question-relevant information from the dialog history with two different textual-views, and integrates multimmodal representations with two-step fusion process. Expand

References

SHOWING 1-10 OF 55 REFERENCES
Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries
TLDR
A unified framework, the ParalleL AttentioN (PLAN) network, to discover the object in an image that is being referred to in variable length natural expression descriptions, from short phrases query to long multi-round dialogs is proposed. Expand
Visual Reference Resolution using Attention Memory for Visual Dialog
TLDR
This work proposes a novel attention mechanism that exploits visual attentions in the past to resolve the current reference in the visual dialog scenario and achieves superior performance in the Visual Dialog dataset, despite having significantly fewer parameters than the baselines. Expand
Dual Attention Networks for Visual Reference Resolution in Visual Dialog
TLDR
This paper proposes Dual Attention Networks (DAN) for visual reference resolution, a model that consists of two kinds of attention networks, REFER and FIND, which outperforms the previous state-of-the-art model by a significant margin. Expand
Visual Dialog
TLDR
A retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response, and a family of neural encoder-decoder models, which outperform a number of sophisticated baselines. Expand
Modality-Balanced Models for Visual Dialogue
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue. However, via manual analysis, we find that a largeExpand
Improved Fusion of Visual and Language Representations by Dense Symmetric Co-attention for Visual Question Answering
TLDR
This work presents a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words, and shows through experiments that the proposed architecture achieves a new state-of-the-art on V QA and VQA 2.0 despite its small size. Expand
Recursive Visual Attention in Visual Dialog
TLDR
To resolve the visual co-reference for visual dialog, the proposed RvA not only outperforms the state-of-the-art methods, but also achieves reasonable recursion and interpretable attention maps without additional annotations. Expand
Factor Graph Attention
TLDR
This work designs a factor graph based attention mechanism for visual dialog which operates on any number of data utilities and illustrates the applicability on the challenging and recently introduced VisDial datasets, outperforming recent state-of-the-art methods. Expand
Are You Talking to Me? Reasoned Visual Dialog Generation Through Adversarial Learning
TLDR
A novel approach that combines Reinforcement Learning and Generative Adversarial Networks (GANS) to generate more human-like responses to questions to overcome the relative paucity of training data, and the tendency of the typical MLE-based approach to generate overly terse answers. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
...
1
2
3
4
5
...