The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training

@article{Kang2022TheDM,
  title={The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training},
  author={Gi-Cheon Kang and Sungdong Kim and Jin-Hwa Kim and Donghyun Kwak and Byoung-Tak Zhang},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.12502}
}
Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog agents solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for visually-grounded dialog, called Generative Self-Training (GST), to leverage unlabeled images on the Web. Specifically, GST first retrieves in-domain images… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 90 REFERENCES
Leveraging Pre-trained Checkpoints for Sequence Generation Tasks
  • S. Rothe, Shashi Narayan, Aliaksei Severyn
  • Computer Science
    Transactions of the Association for Computational Linguistics
  • 2020
TLDR
A Transformer-based sequence-to-sequence model that is compatible with publicly available pre-trained BERT, GPT-2, and RoBERTa checkpoints is developed and an extensive empirical study on the utility of initializing the model, both encoder and decoder, with these checkpoints is conducted.
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
TLDR
The results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks.
Rethinking Pre-training and Self-training
TLDR
Self-training works well exactly on the same setup that pre-training does not work (using ImageNet to help COCO), and on the PASCAL segmentation dataset, though pre- training does help significantly, self-training improves upon the pre-trained model.
UTC: A Unified Transformer with Inter-Task Contrastive Learning for Visual Dialog
TLDR
A contrastive learning-based framework UTC is proposed to unify and facilitate both discriminative and generative tasks in visual dialog with a single model and devise two inter-task contrastive losses to make the discriminatives and generatives tasks mutually reinforce each other.
Self-training Improves Pre-training for Natural Language Understanding
TLDR
SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web, is introduced.
VD-BERT: A Unified Vision and Dialog Transformer with BERT
TLDR
This work proposes VD-BERT, a simple yet effective framework of unified vision-dialog Transformer that leverages the pretrained BERT language models for Visual Dialog tasks, and adapt BERT for the effective fusion of vision and dialog contents via visually grounded training.
FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence
TLDR
This paper demonstrates the power of a simple combination of two common SSL methods: consistency regularization and pseudo-labeling, and shows that FixMatch achieves state-of-the-art performance across a variety of standard semi-supervised learning benchmarks.
Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
TLDR
This work adapts the recently proposed ViLBERT model for multi-turn visually-grounded conversations and finds that additional finetuning using "dense" annotations in VisDial leads to even higher NDCG but hurts MRR, highlighting a trade-off between the two primary metrics.
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a
Learning to Generate Visual Questions with Noisy Supervision
TLDR
A novel learning approach for double-hints based VQG, which can be cast as a weakly supervised learning problem with noises and which outperforms the state-of-the-art approaches by a large margin on a variety of metrics, including both automatic machine metrics and human evaluation.
...
...