Learning a Grammar Inducer from Massive Uncurated Instructional Videos
@inproceedings{Zhang2022LearningAG, title={Learning a Grammar Inducer from Massive Uncurated Instructional Videos}, author={Songyang Zhang and Linfeng Song and Lifeng Jin and Haitao Mi and Kun Xu and Dong Yu and Jiebo Luo}, booktitle={Conference on Empirical Methods in Natural Language Processing}, year={2022} }
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text. While previous work focuses on building systems for inducing grammars on text that are well-aligned with video content, we investigate the scenario, in which text and video are only in loose correspondence. Such data can be found in abundance online, and the weak correspondence is similar to the indeterminacy problem studied in language acquisition. Furthermore…
Figures and Tables from this paper
One Citation
Unsupervised Discontinuous Constituency Parsing with Mildly Context-Sensitive Grammars
- Computer ScienceArXiv
- 2022
This work studies grammar induction with mildly context-sensitive grammars for unsupervised discontinuous parsing using the probabilistic linear context-free rewriting system (LCFRS) formalism and finds that using a large number of nonterminals is beneficial and thus makes use of tensor decomposition-based rank-space dynamic programming with an embedding-based parameterization of rule probabilities to scale up the number ofnonterminals.
61 References
Video-aided Unsupervised Grammar Induction
- Computer ScienceNAACL
- 2021
This paper investigates video-aided grammar induction, which learns a constituency parser from both unlabeled text and its corresponding video, and proposes a Multi-Modal Compound PCFG model (MMC-PCFG), which outperforms each individual modality and previous state-of-the-art systems on three benchmarks.
End-to-End Learning of Visual Representations From Uncurated Instructional Videos
- Computer Science2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2020
This work proposes a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos and outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
It is demonstrated that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask.
Towards Automatic Learning of Procedures From Web Instructional Videos
- Computer ScienceAAAI
- 2018
A segment-level recurrent network is proposed for generating procedure segments by modeling the dependencies across segments and it is shown that the proposed model outperforms competitive baselines in procedure segmentation.
Visually Grounded Compound PCFGs
- Computer ScienceEMNLP
- 2020
This work studies visually grounded grammar induction and learns a constituency parser from both unlabeled text and its visual groundings, and shows that using an extension of probabilistic context-free grammar model, it can do fully-differentiable end-to-end visually grounded learning.
Grounded PCFG Induction with Images
- Computer ScienceAACL
- 2020
Compared between models with and without visual information shows that the grounded models are able to use visual information for proposing noun phrases, gathering useful information from images for unknown words, and achieving better performance at prepositional phrase attachment prediction.
Visually Grounded Neural Syntax Acquisition
- Computer ScienceACL
- 2019
We present the Visually Grounded Neural Syntax Learner (VG-NSL), an approach for learning syntactic representations and structures without any explicit supervision. The model learns by looking at…
Unsupervised Learning of PCFGs with Normalizing Flow
- Computer ScienceACL
- 2019
A neural PCFG inducer which employs context embeddings (Peters et al., 2018) in a normalizing flow model to extend PCFG induction to use semantic and morphological information.
On the Role of Supervision in Unsupervised Constituency Parsing
- Computer ScienceEMNLP
- 2020
We analyze several recent unsupervised constituency parsing models, which are tuned with respect to the parsing $F_1$ score on the Wall Street Journal (WSJ) development set (1,700 sentences). We…
Learning Transferable Visual Models From Natural Language Supervision
- Computer ScienceICML
- 2021
It is demonstrated that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet.