What Is More Likely to Happen Next? Video-and-Language Future Event Prediction

  title={What Is More Likely to Happen Next? Video-and-Language Future Event Prediction},
  author={Jie Lei and Licheng Yu and Tamara L. Berg and Mohit Bansal},
Given a video with aligned dialogue, people can often infer what is more likely to happen next. Making such predictions requires not only a deep understanding of the rich dynamics underlying the video and dialogue, but also a significant amount of commonsense knowledge. In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions. To support research in this direction, we collect a new dataset, named Video-and-Language Event Prediction… 

Multimodal Neural Script Knowledge Models

This work introduces MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech – in an entirely label-free, self-supervised manner, and achieves state-ofthe-art performance on 12 different video QA datasets when finetuned.

MERLOT: Multimodal Neural Script Knowledge Models

This work introduces MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech – in an entirely label-free, self-supervised manner, and achieves state-ofthe-art performance on 12 different video QA datasets when finetuned.

Revisiting the “Video” in Video-Language Understanding

It is found that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding.

Learning-by-Narrating: Narrative Pre-Training for Zero-Shot Dialogue Comprehension

A novel narrative-guided pre-training strategy that learns by narrating the key information from a dialogue input by automatically aligning movie subtitles and their synopses is developed and Experimental results show that the model achieves superior zero-shot performance but also exhibits stronger fine-grained dialogue comprehension capabilities.

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners

The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction, which outperforms state-of-the-art supervised models trained on any video datasets.

Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos

This work proposes a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video leveraging solely video-level labels and employs an attention mechanism based strategy that predicts the temporal regions which contributes the most to a classification task.

Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling

A novel pre-training objective is proposed, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences and outperforms previous work pre-trained on orders of magnitude larger datasets.

Revealing Single Frame Bias for Video-and-Language Learning

This work shows the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training.

When can I Speak? Predicting initiation points for spoken dialogue agents

This work predicts the lead-time to initiation using prosodic features from a pre-trained speech representation model (wav2vec 1.0) operating on user audio and word features from an GPT-2 model operating on incremental transcriptions and finds that the method outperforms features from prior work on both metrics and vastly outperforms the common approach of waiting for 700ms of silence.

QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries

A transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end is presented and shows that it performs competitively when compared to well-engineered architectures.



Violin: A Large-Scale Dataset for Video-and-Language Inference

A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video.

DeepStory: Video Story QA by Deep Embedded Memory Networks

A video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data to outperform other QA models.

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.

Anticipating Visual Representations from Unlabeled Video

This work presents a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects and applies recognition algorithms on the authors' predicted representation to anticipate objects and actions.

Localizing Moments in Video with Natural Language

The Moment Context Network (MCN) is proposed which effectively localizes natural language queries in videos by integrating local and global video features over time and outperforms several baseline methods.

TVQA: Localized, Compositional Video Question Answering

This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

Generating Videos with Scene Dynamics

A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.

Visual Commonsense Graphs: Reasoning about the Dynamic Context of a Still Image

This work proposes VisualComet, the novel framework of visual commonsense reasoning tasks to predict events thatmight have happened before, events that might happen next, and the intents of the people at present, and establishes strong baseline performances on this task and demonstrates that integration between visual and textual Commonsense reasoning is the key and wins over non-integrative alternatives.

Oops! Predicting Unintentional Action in Video

A dataset of in-the-wild videos of unintentional action, as well as a suite of tasks for recognizing, localizing, and anticipating its onset, and a supervised neural network is trained as a baseline and its performance compared to human consistency on the tasks is analyzed.