VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs
@article{Lin2021VX2TEXTEL, title={VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs}, author={Xudong Lin and Gedas Bertasius and Jue Wang and Shih-Fu Chang and Devi Parikh and Lorenzo Torresani}, journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2021}, pages={7001-7011} }
We present VX2TEXT, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. In order to leverage transformer networks, which have been shown to be effective at modeling language, each modality is first converted into a set of language embeddings by a learnable tokenizer. This allows our approach to perform multimodal fusion in the language space, thus eliminating the need for ad-hoc cross-modal fusion modules. To address the non-differentiability…
Figures and Tables from this paper
18 Citations
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
- Computer ScienceArXiv
- 2022
The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction, which outperforms state-of-the-art supervised models trained on any video datasets.
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval
- Computer Science
- 2022
This paper identifies a principled model design space with two axes: how to represent videos and how to fuse video and text information, and surprisingly finds that discrete text tokens coupled with a pretrained contrastive text model yields the best performance.
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
- Computer ScienceArXiv
- 2022
This work builds on frozen bidirectional language models (BiLM) and shows that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA and demonstrates competitive performance in the few-shot and fully-supervised setting.
DVCFlow: Modeling Information Flow Towards Human-like Video Captioning
- Computer ScienceArXiv
- 2021
By designing a Cross-modal Information Flow Alignment mechanism, the visual and textual information flows are captured and aligned, which endows the captioning process with richer context and dynamics on event/topic evolution.
Generative Adversarial Network for Text-to-Face Synthesis and Manipulation
- Computer ScienceACM Multimedia
- 2021
This work proposes an approach for facial image generation and manipulation from text descriptions, and introduces the first Text-to-Face synthesis dataset with large-scale facial attributes.
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
- Computer ScienceArXiv
- 2022
MERLOT Reserve is introduced, a model that represents videos jointly over time – through a new training objective that learns from audio, subtitles, and video frames, which enables out-of-the-box prediction, revealing strong multimodal commonsense understanding.
Learning to Retrieve Videos by Asking Questions
- Computer ScienceArXiv
- 2022
This work proposes a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog and proposes an Information-Guided Supervision (IGS), which guides the question generator to ask questions that would boost subsequent video retrieval accuracy.
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation
- Computer ScienceNeurIPS Datasets and Benchmarks
- 2021
Video-And-Language Understanding Evaluation (VALUE) benchmark is introduced, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning, which promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks.
Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering
- Computer ScienceArXiv
- 2022
This work proposes a novel and simple recipe to pre-train a vision-language joint model, which is multi-task as well, and observes that the proposed approach is able to generalize to unseen tasks and that more diverse mixtures lead to higher accuracy in both known and novel tasks.
Revealing Single Frame Bias for Video-and-Language Learning
- Computer ScienceArXiv
- 2022
This work shows the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training.
References
SHOWING 1-10 OF 62 REFERENCES
Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems
- Computer ScienceACL
- 2019
A training procedure to simulate token-level decoding to improve the quality of generated responses during inference and a proposed Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities.
UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
- Computer ScienceArXiv
- 2020
Experimental results demonstrate that the UniVL can learn strong video-text representation and achieves state-of-the-art results on five downstream tasks.
Unified Vision-Language Pre-Training for Image Captioning and VQA
- Computer ScienceAAAI
- 2020
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- Computer ScienceJ. Mach. Learn. Res.
- 2020
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Hero: Hierarchical Encoder for Video+Language Omni-representation Pre-training
- Computer ScienceEMNLP
- 2020
HELP, a novel framework for large-scale video+language omni-representation learning that achieves new state of the art on multiple benchmarks over Text-based Video/Video-moment Retrieval, Video Question Answering (QA), Video-and-language Inference and Video Captioning tasks across different domains is presented.
VideoBERT: A Joint Model for Video and Language Representation Learning
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.
Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
A novel model is proposed based on a multimodal transformer architecture accompanied by a rich representation for text in images that enables iterative answer decoding with a dynamic pointer network, allowing the model to form an answer through multi-step prediction instead of one-step classification.
TMT: A Transformer-based Modal Translator for Improving Multimodal Sequence Representations in Audio Visual Scene-aware Dialog
- Computer ScienceINTERSPEECH
- 2020
Based on Multimodal Transformer Networks (MTN), TMT is applied to video and dialog, proposing MTN-TMT for the video-grounded dialog system, which outperforms the MTN and other submission models in both Video and Text task and Text Only task.
End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
This paper introduces a new data set of dialogs about videos of human behaviors, as well as an end-to-end Audio Visual Scene-Aware Dialog (AVSD) model, trained using thisnew data set, that generates responses in a dialog about a video.
ActBERT: Learning Global-Local Video-Text Representations
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This paper introduces ActBERT for self-supervised learning of joint video-text representations from unlabeled data and introduces an ENtangled Transformer block to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions.