• Corpus ID: 235352775

MERLOT: Multimodal Neural Script Knowledge Models

  title={MERLOT: Multimodal Neural Script Knowledge Models},
  author={Rowan Zellers and Ximing Lu and Jack Hessel and Youngjae Yu and Jae Sung Park and Jize Cao and Ali Farhadi and Yejin Choi},
As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech – in an entirely label-free, self-supervised manner. By pretraining with a mix of both framelevel (spatial) and video-level (temporal) objectives, our model not only learns to match images to… 

MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound

MERLOT Reserve is introduced, a model that represents videos jointly over time – through a new training objective that learns from audio, subtitles, and video frames, which enables out-of-the-box prediction, revealing strong multimodal commonsense understanding.

Flamingo: a Visual Language Model for Few-Shot Learning

It is demonstrated that a single Flamingo model can achieve a new state of the art for few-shot learning, simply by prompting the model with task-specific examples.

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction, which outperforms state-of-the-art supervised models trained on any video datasets.

Modality Alignment between Deep Representations for Effective Video-and-Language Learning

A novel Modality Alignment method is proposed that benefits the cross-modality attention module by guiding it to easily amalgamate multiple modalities by exploiting Centered Kernel Alignment (CKA), which was originally proposed to measure the similarity between two deep representations.

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

The proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, and demonstrates competitive performance in the few-shot and fully-supervised setting.

Multimodal Knowledge Alignment with Reinforcement Learning

This work proposes ESPER, a novel approach to reinforcement learning which extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning, and demonstrates that it outperforms baselines and prior work on a variety of zero- shot tasks.

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Extensive analyses further demonstrate the advantage of L AVENDER over existing VidL methods in supporting all downstream tasks with just a single set of parameter values when multi-task fine-tuned; few-shot generalization on various downstream tasks; and enabling zero-shot evaluation on video question answering tasks.

Multimodal Event Graphs: Towards Event Centric Understanding of Multimodal World

The novel task of M ulti M odal E vent E vent R elations (M 2 E 2 R) to recognize cross-modal event relations is proposed, and the proposed, MERP (Multimodal Event Relations Predictor) on these pseudo labels while also leveraging commonsense knowledge from an external Knowledge Base is evaluated.

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

A novel High-resolution and Diversified VI deo- LA nguage pre-training model (HD-VILA) for many visual tasks that achieves new state-of-the-art results in 10 VL understanding tasks and 2 more novel text-to-visual generation tasks.

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Clover is introduced —a Correlated Video-Language pre-training method—towards a universal video-language model for solving multiple video understanding tasks with neither performance nor efficiency compromise, and establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and zero-tuning settings.



TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

VideoBERT: A Joint Model for Video and Language Representation Learning

This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.

ActBERT: Learning Global-Local Video-Text Representations

  • Linchao ZhuYi Yang
  • Computer Science
    2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2020
This paper introduces ActBERT for self-supervised learning of joint video-text representations from unlabeled data and introduces an ENtangled Transformer block to encode three sources of information, i.e., global actions, local regional objects, and linguistic descriptions.

Learning Temporal Dynamics from Cycles in Narrated Video

This work proposes a self-supervised solution to temporal cycle consistency jointly in vision and language, training on narrated video, that learns modality-agnostic functions to predict forward and backward in time, which must undo each other when composed.

What Is More Likely to Happen Next? Video-and-Language Future Event Prediction

This work collects a new dataset, named Video-and-Language Event Prediction (VLEP), with 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog video clips, and presents a strong baseline incorporating information from video, dialogue, and commonsense knowledge.

DramaQA: Character-Centered Video Story Understanding with Hierarchical QA

A novel video question answering (Video QA) task, DramaQA, for a comprehensive understanding of the video story, and suggests Multi-level Context Matching model which hierarchically understands character-centered representations of video to answer questions.

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering

This task is not solvable by a language model alone, and the model combining 2D and 3D visual information indeed provides the best result, all models perform significantly worse than human-level.

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a