Learning a Grammar Inducer from Massive Uncurated Instructional Videos

  title={Learning a Grammar Inducer from Massive Uncurated Instructional Videos},
  author={Songyang Zhang and Linfeng Song and Lifeng Jin and Haitao Mi and Kun Xu and Dong Yu and Jiebo Luo},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text. While previous work focuses on building systems for inducing grammars on text that are well-aligned with video content, we investigate the scenario, in which text and video are only in loose correspondence. Such data can be found in abundance online, and the weak correspondence is similar to the indeterminacy problem studied in language acquisition. Furthermore… 

Figures and Tables from this paper

Unsupervised Discontinuous Constituency Parsing with Mildly Context-Sensitive Grammars

This work studies grammar induction with mildly context-sensitive grammars for unsupervised discontinuous parsing using the probabilistic linear context-free rewriting system (LCFRS) formalism and finds that using a large number of nonterminals is beneficial and thus makes use of tensor decomposition-based rank-space dynamic programming with an embedding-based parameterization of rule probabilities to scale up the number ofnonterminals.

Video-aided Unsupervised Grammar Induction

This paper investigates video-aided grammar induction, which learns a constituency parser from both unlabeled text and its corresponding video, and proposes a Multi-Modal Compound PCFG model (MMC-PCFG), which outperforms each individual modality and previous state-of-the-art systems on three benchmarks.

End-to-End Learning of Visual Representations From Uncurated Instructional Videos

This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.

Towards Automatic Learning of Procedures From Web Instructional Videos

A segment-level recurrent network is proposed for generating procedure segments by modeling the dependencies across segments and it is shown that the proposed model outperforms competitive baselines in procedure segmentation.

Visually Grounded Compound PCFGs

This work studies visually grounded grammar induction and learns a constituency parser from both unlabeled text and its visual groundings, and shows that using an extension of probabilistic context-free grammar model, it can do fully-differentiable end-to-end visually grounded learning.

Grounded PCFG Induction with Images

Compared between models with and without visual information shows that the grounded models are able to use visual information for proposing noun phrases, gathering useful information from images for unknown words, and achieving better performance at prepositional phrase attachment prediction.

Visually Grounded Neural Syntax Acquisition

We present the Visually Grounded Neural Syntax Learner (VG-NSL), an approach for learning syntactic representations and structures without any explicit supervision. The model learns by looking at

Unsupervised Learning of PCFGs with Normalizing Flow

A neural PCFG inducer which employs context embeddings (Peters et al., 2018) in a normalizing flow model to extend PCFG induction to use semantic and morphological information.

On the Role of Supervision in Unsupervised Constituency Parsing

We analyze several recent unsupervised constituency parsing models, which are tuned with respect to the parsing $F_1$ score on the Wall Street Journal (WSJ) development set (1,700 sentences). We

Learning Transferable Visual Models From Natural Language Supervision

It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.