VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

@article{Lin2021VX2TEXTEL,
  title={VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs},
  author={Xudong Lin and Gedas Bertasius and Jue Wang and Shih-Fu Chang and Devi Parikh and Lorenzo Torresani},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021},
  pages={7001-7011}
}
We present VX2TEXT, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. In order to leverage transformer networks, which have been shown to be effective at modeling language, each modality is first converted into a set of language embeddings by a learnable tokenizer. This allows our approach to perform multimodal fusion in the language space, thus eliminating the need for ad-hoc cross-modal fusion modules. To address the non-differentiability… 
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
TLDR
The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction, which outperforms state-of-the-art supervised models trained on any video datasets.
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval
TLDR
This paper identifies a principled model design space with two axes: how to represent videos and how to fuse video and text information, and surprisingly finds that discrete text tokens coupled with a pretrained contrastive text model yields the best performance.
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
TLDR
This work builds on frozen bidirectional language models (BiLM) and shows that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA and demonstrates competitive performance in the few-shot and fully-supervised setting.
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
TLDR
By designing a Cross-modal Information Flow Alignment mechanism, the visual and textual information flows are captured and aligned, which endows the captioning process with richer context and dynamics on event/topic evolution.
Generative Adversarial Network for Text-to-Face Synthesis and Manipulation
TLDR
This work proposes an approach for facial image generation and manipulation from text descriptions, and introduces the first Text-to-Face synthesis dataset with large-scale facial attributes.
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
TLDR
MERLOT Reserve is introduced, a model that represents videos jointly over time – through a new training objective that learns from audio, subtitles, and video frames, which enables out-of-the-box prediction, revealing strong multimodal commonsense understanding.
Learning to Retrieve Videos by Asking Questions
TLDR
This work proposes a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog and proposes an Information-Guided Supervision (IGS), which guides the question generator to ask questions that would boost subsequent video retrieval accuracy.
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation
TLDR
Video-And-Language Understanding Evaluation (VALUE) benchmark is introduced, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning, which promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks.
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
TLDR
This work proposes a novel and simple recipe to pre-train a vision-language joint model, which is multi-task as well, and observes that the proposed approach is able to generalize to unseen tasks and that more diverse mixtures lead to higher accuracy in both known and novel tasks.
Revealing Single Frame Bias for Video-and-Language Learning
TLDR
This work shows the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training.
...
...

References

SHOWING 1-10 OF 62 REFERENCES
Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems
TLDR
A training procedure to simulate token-level decoding to improve the quality of generated responses during inference and a proposed Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities.
UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
TLDR
Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training
TLDR
HELP, a novel framework for large-scale video+language omni-representation learning that achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference and Video Captioning tasks across different domains is presented.
VideoBERT: A Joint Model for Video and Language Representation Learning
TLDR
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA
TLDR
A novel model is proposed based on a multimodal transformer architecture accompanied by a rich representation for text in images that enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification.
TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog
TLDR
Based on Multimodal Transformer Networks (MTN), TMT is applied to video and dialog, proposing MTN-TMT for the video-grounded dialog system, which outperforms the MTN and other submission models in both Video and Text task and Text Only task.
End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features
  • Chiori Hori, Huda AlAmri, Devi Parikh
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
This paper introduces a new data set of dialogs about videos of human behaviors, as well as an end-to-end Audio Visual Scene-Aware Dialog (AVSD) model, trained using thisnew data set, that generates responses in a dialog about a video.
ActBERT: Learning Global-Local Video-Text Representations
  • Linchao Zhu, Yi Yang
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
TLDR
This paper introduces ActBERT for self-supervised learning of joint video-text representations from unlabeled data and introduces an ENtangled Transformer block to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions.
...
...