Grounded Language Learning from Video Described with Sentences
@inproceedings{Yu2013GroundedLL, title={Grounded Language Learning from Video Described with Sentences}, author={Haonan Yu and Jeffrey Mark Siskind}, booktitle={ACL}, year={2013} }
We present a method that learns representations for word meanings from short video clips paired with sentences. Unlike prior work on learning language from symbolic input, our input consists of video of people interacting with multiple complex objects in outdoor environments. Unlike prior computer-vision approaches that learn from videos with verb labels or images with noun labels, our labels are sentences containing nouns, verbs, prepositions, adjectives, and adverbs. The correspondence…
136 Citations
Discriminative Training: Learning to Describe Video with Sentences, from Video Described with Sentences
- Computer ScienceArXiv
- 2013
The new method is able to automatically determine which words in the sentence correspond to which concepts in the video in a weakly supervised fashion and outperforms ML significantly with smaller training sets because it can exploit negative training labels to better constrain the learning problem.
Learning to Describe Video with Weak Supervision by Exploiting Negative Sentential Information
- Computer ScienceAAAI
- 2015
This paper learns to describe video by discriminatively training positive sentential labels against negative ones in a weakly supervised fashion, where the meaning representations of individual words in these labels are learned from whole sentences without any correspondence annotation of what those words denote in the video.
A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video
- LinguisticsJ. Artif. Intell. Res.
- 2015
It is demonstrated that these models faithfully represent the meanings of sentences and are sensitive to how the roles played by participants, their characteristics, the actions performed, the manner of such actions, and changing spatial relations between participants affect the meaning of a sentence and how it is grounded in video.
Unsupervised Semantic Action Discovery from Video Collections
- Computer ScienceArXiv
- 2016
This paper proposes a method for parsing a video into semantic steps in an unsupervised way, capable of providing a semantic "storyline" of the video composed of its objective steps.
Unsupervised Alignment of Natural Language Instructions with Video Segments
- Computer ScienceAAAI
- 2014
An unsupervised learning algorithm for automatically inferring the mappings between English nouns and corresponding video objects and two generative models that are closely related to the HMM and IBM 1 word alignment models used in statistical machine translation are proposed.
Beyond verbs: Understanding actions in videos with text
- Computer Science2016 23rd International Conference on Pattern Recognition (ICPR)
- 2016
This work considers the problem of joint modeling of videos and their corresponding textual descriptions (e.g. sentences or phrases) and develops a joint model that links videos and text.
Learning Visually Grounded and Multilingual Representations
- Computer Science
- 2019
A novel computational model of cross-situational word learning is proposed that takes images of natural scenes paired with their descriptions as input and incrementally learns probabilistic associations between words and image features.
Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments
- Computer ScienceNAACL
- 2015
This work proposes three latent-variable discriminative models that are capable of incorporating domain knowledge, by adding diverse and overlapping features to the unsupervised alignment task of natural language sentences with corresponding video segments.
Unsupervised Alignment of Natural Language with Video
- Computer Science
- 2015
By exploiting the temporal ordering constraints between video and associated text, it is possible to automatically align the sentences in the text with the corresponding video frames without any direct human supervision.
Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild
- Computer ScienceCOLING
- 2014
This paper proposes a strategy for generating textual descriptions of videos by using a factor graph to combine visual detections with language statistics, and uses state-of-the-art visual recognition systems to obtain confidences on entities, activities, and scenes present in the video.
References
SHOWING 1-10 OF 27 REFERENCES
Learning visually grounded words and syntax for a scene description task
- Computer Science, LinguisticsComput. Speech Lang.
- 2002
Learning to talk about events from narrated video in a construction grammar framework
- LinguisticsArtif. Intell.
- 2005
Video In Sentences Out
- LinguisticsUAI
- 2012
A system that produces sentential descriptions of video: who did what to whom, and where and how they did it, with an approach to event recognition that recovers object tracks, the track-to-role assignments, and changing body posture.
On the Integration of Grounding Language and Learning Objects
- Computer ScienceAAAI
- 2004
A multimodal learning system that can ground spoken names of objects in their physical referents and learn to recognize those objects simultaneously from naturally co-occurring multisensory input and incorporating the spatio-temporal and cross-modal constraints of multimodals.
A Probabilistic Model of Syntactic and Semantic Acquisition from Child-Directed Utterances and their Meanings
- Linguistics, Computer ScienceEACL
- 2012
An incremental probabilistic learner that models the acquistion of syntax and semantics from a corpus of child-directed utterances paired with possible representations of their meanings while also countering previous criticisms of statistical syntactic learners is presented.
Baby talk: Understanding and generating simple image descriptions
- Computer ScienceCVPR 2011
- 2011
A system to automatically generate natural language descriptions from images that exploits both statistics gleaned from parsing large quantities of text data and recognition algorithms from computer vision that is very effective at producing relevant sentences for images.
I2T: Image Parsing to Text Description
- Computer ScienceProceedings of the IEEE
- 2010
An image parsing to text description (I2T) framework that generates text descriptions of image and video content based on image understanding and uses automatic methods to parse image/video in specific domains and generate text reports that are useful for real-world applications.
A Maximum-Likelihood Approach to Visual Event Classification
- Computer ScienceECCV
- 1996
This paper presents a novel framework, based on maximum likelihood, for training models to recognise simple spatial-motion events, such as those described by the verbs pick up, put down, push, pull,…
Collective Generation of Natural Image Descriptions
- Computer ScienceACL
- 2012
A holistic data-driven approach to image description generation, exploiting the vast amount of (noisy) parallel image data and associated natural language descriptions available on the web to generate novel descriptions for query images.
Learning to sportscast: a test of grounded language acquisition
- Computer ScienceICML '08
- 2008
A novel commentator system that learns language from sportscasts of simulated soccer games and uses a novel algorithm, Iterative Generation Strategy Learning (IGSL), for deciding which events to comment on.