What Is More Likely to Happen Next? Video-and-Language Future Event Prediction
@article{Lei2020WhatIM, title={What Is More Likely to Happen Next? Video-and-Language Future Event Prediction}, author={Jie Lei and Licheng Yu and Tamara L. Berg and Mohit Bansal}, journal={ArXiv}, year={2020}, volume={abs/2010.07999} }
Given a video with aligned dialogue, people can often infer what is more likely to happen next. Making such predictions requires not only a deep understanding of the rich dynamics underlying the video and dialogue, but also a significant amount of commonsense knowledge. In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions. To support research in this direction, we collect a new dataset, named Video-and-Language Event Prediction…
Figures and Tables from this paper
35 Citations
Multimodal Neural Script Knowledge Models
- Computer Science
- 2021
This work introduces MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech – in an entirely label-free, self-supervised manner, and achieves state-ofthe-art performance on 12 different video QA datasets when finetuned.
MERLOT: Multimodal Neural Script Knowledge Models
- Computer ScienceNeurIPS
- 2021
This work introduces MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech – in an entirely label-free, self-supervised manner, and achieves state-ofthe-art performance on 12 different video QA datasets when finetuned.
Revisiting the “Video” in Video-Language Understanding
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2022
It is found that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding.
Learning-by-Narrating: Narrative Pre-Training for Zero-Shot Dialogue Comprehension
- Computer ScienceACL
- 2022
A novel narrative-guided pre-training strategy that learns by narrating the key information from a dialogue input by automatically aligning movie subtitles and their synopses is developed and Experimental results show that the model achieves superior zero-shot performance but also exhibits stronger fine-grained dialogue comprehension capabilities.
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
- Computer ScienceArXiv
- 2022
The goal of this work is to build flexible video-language models that can generalize to various video-to-text tasks from few examples, such as domain-specific captioning, question answering, and future event prediction, which outperforms state-of-the-art supervised models trained on any video datasets.
Tragedy Plus Time: Capturing Unintended Human Activities from Weakly-labeled Videos
- Computer Science2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
- 2022
This work proposes a weakly supervised algorithm for localizing the goal-directed as well as unintentional temporal regions in the video leveraging solely video-level labels and employs an attention mechanism based strategy that predicts the temporal regions which contributes the most to a classification task.
Learning Fine-Grained Visual Understanding for Video Question Answering via Decoupling Spatial-Temporal Modeling
- Computer ScienceBMVC
- 2022
A novel pre-training objective is proposed, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences and outperforms previous work pre-trained on orders of magnitude larger datasets.
Revealing Single Frame Bias for Video-and-Language Learning
- Computer ScienceArXiv
- 2022
This work shows the surprising result that, with large-scale pre-training and a proper frame ensemble strategy at inference time, a single-frame trained model that does not consider temporal information can achieve better performance than existing methods that use multiple frames for training.
When can I Speak? Predicting initiation points for spoken dialogue agents
- Computer ScienceSIGDIAL
- 2022
This work predicts the lead-time to initiation using prosodic features from a pre-trained speech representation model (wav2vec 1.0) operating on user audio and word features from an GPT-2 model operating on incremental transcriptions and finds that the method outperforms features from prior work on both metrics and vastly outperforms the common approach of waiting for 700ms of silence.
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries
- Computer ScienceNeurIPS
- 2021
A transformer encoder-decoder model that views moment retrieval as a direct set prediction problem, taking extracted video and query representations as inputs and predicting moment coordinates and saliency scores end-to-end is presented and shows that it performs competitively when compared to well-engineered architectures.
References
SHOWING 1-10 OF 65 REFERENCES
Violin: A Large-Scale Dataset for Video-and-Language Inference
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video.
DeepStory: Video Story QA by Deep Embedded Memory Networks
- Computer ScienceIJCAI
- 2017
A video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data to outperform other QA models.
TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
This paper proposes three new tasks designed specifically for video VQA, which require spatio-temporal reasoning from videos to answer questions correctly and introduces a new large-scale dataset for videoVQA named TGIF-QA that extends existing VQ a work with its new tasks.
Anticipating Visual Representations from Unlabeled Video
- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016
This work presents a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects and applies recognition algorithms on the authors' predicted representation to anticipate objects and actions.
Localizing Moments in Video with Natural Language
- Computer Science2017 IEEE International Conference on Computer Vision (ICCV)
- 2017
The Moment Context Network (MCN) is proposed which effectively localizes natural language queries in videos by integrating local and global video features over time and outperforms several baseline methods.
TVQA: Localized, Compositional Video Question Answering
- Computer ScienceEMNLP
- 2018
This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
Generating Videos with Scene Dynamics
- Computer ScienceNIPS
- 2016
A generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background is proposed that can generate tiny videos up to a second at full frame rate better than simple baselines.
Visual Commonsense Graphs: Reasoning about the Dynamic Context of a Still Image
- Computer ScienceArXiv
- 2020
This work proposes VisualComet, the novel framework of visual commonsense reasoning tasks to predict events thatmight have happened before, events that might happen next, and the intents of the people at present, and establishes strong baseline performances on this task and demonstrates that integration between visual and textual Commonsense reasoning is the key and wins over non-integrative alternatives.
Oops! Predicting Unintentional Action in Video
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
A dataset of in-the-wild videos of unintentional action, as well as a suite of tasks for recognizing, localizing, and anticipating its onset, and a supervised neural network is trained as a baseline and its performance compared to human consistency on the tasks is analyzed.