• Corpus ID: 235446479

C3: Compositional Counterfactual Constrastive Learning for Video-grounded Dialogues

@article{Le2021C3CC,
  title={C3: Compositional Counterfactual Constrastive Learning for Video-grounded Dialogues},
  author={Hung Le and Nancy F. Chen and Steven C. H. Hoi},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.08914}
}
Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. However, the results are partly accomplished by exploiting biases in the datasets rather than developing multimodal reasoning, resulting in limited generalization. In this paper, we… 

End-to-End Multimodal Representation Learning for Video Dialog

This study proposes a new framework that combines 3D-CNN network and transformer-based networks into a single visual encoder to extract more robust semantic representations from videos.

COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality

This work proposes COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally and demonstrates the model’s strength and interpretability on two widely-used datasets.

COMPOSER: Compositional Learning of Group Activity in Videos

This work proposes COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale and learns group activity compositionally and achieves a new state-of-the-art 94.5% accuracy with the keypoint-only modality.

References

SHOWING 1-10 OF 78 REFERENCES

DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

This paper presents DVD, a Diagnostic Dataset for Video-grounded Dialogue, designed to contain minimal biases and has detailed annotations for the different types of reasoning over the spatio-temporal space of video.

CoCo: Controllable Counterfactuals for Evaluating Dialogue State Trackers

Human evaluations show that CoCo-generated conversations perfectly reflect the underlying user goal with more than 95% accuracy and are as human-like as the original conversations, further strengthening its reliability and promise to be adopted as part of the robustness evaluation of DST models.

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

A training procedure to simulate token-level decoding to improve the quality of generated responses during inference and a proposed Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities.

Bridging Text and Video: A Universal Multimodal Transformer for Audio-Visual Scene-Aware Dialog

This work proposes a universal multimodal transformer and introduces the multi-task learning method to learn joint representations among different modalities as well as generate informative and fluent responses by leveraging the pre-trained language model.

From FiLM to Video: Multi-turn Question Answering with Multi-modal Context

A hierarchical encoder-decoder model is proposed which computes a multi-modal embedding of the dialogue context and achieves relative improvements of more than 16%, scored 0.36 BLEU-4 and more than 33%, scoring 0.997 CIDEr.

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

This work proposes Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos based on textual cues that achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.

Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding

This paper designs three counterfactual transformation strategies from the feature-, interactionand relation-level, where the feature-level method damages the visual features of selected proposals, interactionlevel approach confuses the vision-language interaction and relation- level strategy destroys the context clues in proposal relationships.

Visual Dialog

A retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response, and a family of neural encoder-decoder models, which outperform a number of sophisticated baselines.

Multimodal Explanations by Predicting Counterfactuality in Videos

The effectiveness of the proposed explanation model, trained to predict the counterfactuality for possible combinations of multimodal information in a post-hoc manner, is demonstrated by comparison with a baseline of the action recognition datasets extended for this task.

Learning to Contrast the Counterfactual Samples for Robust Visual Question Answering

This work introduces a novel self-supervised contrastive learning mechanism to learn the relationship between original samples, factual samples and counterfactual samples and evaluates the effectiveness by surpassing current state-of-the-art models on the VQA-CP dataset.
...