Beyond Instructional Videos: Probing for More Diverse Visual-Textual Grounding on YouTube

@inproceedings{Hessel2020BeyondIV,
  title={Beyond Instructional Videos: Probing for More Diverse Visual-Textual Grounding on YouTube},
  author={Jack Hessel and Zhenhai Zhu and Bo Pang and Radu Soricut},
  booktitle={EMNLP},
  year={2020}
}
Pretraining from unlabelled web videos has quickly become the de-facto means of achieving high performance on many video understanding tasks. Features are learned via prediction of grounded relationships between visual content and automatic speech recognition (ASR) tokens. However, prior pretraining work has been limited to only instructional videos, a domain that, a priori, we expect to be relatively "easy:" speakers in instructional videos will often reference the literal objects/actions… 

Figures and Tables from this paper

MERLOT: Multimodal Neural Script Knowledge Models

TLDR
This work introduces MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech – in an entirely label-free, self-supervised manner, and achieves state-ofthe-art performance on 12 different video QA datasets when finetuned.

Multimodal Neural Script Knowledge Models

TLDR
This work introduces MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech – in an entirely label-free, self-supervised manner, and achieves state-ofthe-art performance on 12 different video QA datasets when finetuned.

Grounding ‘Grounding’ in NLP

TLDR
This work investigates the gap between definitions of “grounding” in NLP and Cognitive Science, and presents ways to both create new tasks or repurpose existing ones to make advancements towards achieving a more complete sense of grounding.

References

SHOWING 1-10 OF 33 REFERENCES

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

TLDR
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

Instructional Videos for Unsupervised Harvesting and Learning of Action Examples

TLDR
This work proposes to utilize the large amount of instructional videos available online to harvest examples of various actions in an unsupervised fashion to exploit the timing of action corresponding terms in the speech transcript to temporally localize actions in the video and harvest action examples.

YouMakeup: A Large-Scale Domain-Specific Multimodal Dataset for Fine-Grained Semantic Comprehension

TLDR
This work introduces “YouMakeup”, a large-scale multimodal instructional video dataset to support fine-grained semantic comprehension research in specific domain and proposes two groups of tasks including generation tasks and visual question answering from different aspects.

Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

TLDR
To align movies and books, a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book are proposed.

VideoBERT: A Joint Model for Video and Language Representation Learning

TLDR
This work builds upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively, which can be applied directly to open-vocabulary classification.

Cross-Task Weakly Supervised Learning From Instructional Videos

TLDR
The experiments demonstrate that sharing across tasks improves performance, especially when done at the component level and that the component model can parse previously unseen tasks by virtue of its compositionality.

YouTube-8M: A Large-Scale Video Classification Benchmark

TLDR
YouTube-8M is introduced, the largest multi-label video classification dataset, composed of ~8 million videos (500K hours of video), annotated with a vocabulary of 4800 visual entities, and various (modest) classification models are trained on the dataset.

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language

TLDR
A detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches, shows that the hybrid Recurrent Neural Networkbased approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on this dataset.

Unsupervised Semantic Parsing of Video Collections

TLDR
The proposed method is capable of providing a semantic "storyline" of the video composed of its objective steps, utilizing both visual and language cues in a joint generative model.

How2: A Large-scale Dataset for Multimodal Language Understanding

TLDR
How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations, is introduced, and integrated sequence-to-sequence baselines for machine translation, automatic speech recognition, spoken language translation, and multi-modal summarization are presented.