• Corpus ID: 5061155

Grounded Language Learning from Video Described with Sentences

@inproceedings{Yu2013GroundedLL,
  title={Grounded Language Learning from Video Described with Sentences},
  author={Haonan Yu and Jeffrey Mark Siskind},
  booktitle={ACL},
  year={2013}
}
We present a method that learns representations for word meanings from short video clips paired with sentences. Unlike prior work on learning language from symbolic input, our input consists of video of people interacting with multiple complex objects in outdoor environments. Unlike prior computer-vision approaches that learn from videos with verb labels or images with noun labels, our labels are sentences containing nouns, verbs, prepositions, adjectives, and adverbs. The correspondence… 

Figures and Tables from this paper

Discriminative Training: Learning to Describe Video with Sentences, from Video Described with Sentences
TLDR
The new method is able to automatically determine which words in the sentence correspond to which concepts in the video in a weakly supervised fashion and outperforms ML significantly with smaller training sets because it can exploit negative training labels to better constrain the learning problem.
Learning to Describe Video with Weak Supervision by Exploiting Negative Sentential Information
TLDR
This paper learns to describe video by discriminatively training positive sentential labels against negative ones in a weakly supervised fashion, where the meaning representations of individual words in these labels are learned from whole sentences without any correspondence annotation of what those words denote in the video.
A Compositional Framework for Grounding Language Inference, Generation, and Acquisition in Video
TLDR
It is demonstrated that these models faithfully represent the meanings of sentences and are sensitive to how the roles played by participants, their characteristics, the actions performed, the manner of such actions, and changing spatial relations between participants affect the meaning of a sentence and how it is grounded in video.
Unsupervised Semantic Action Discovery from Video Collections
TLDR
This paper proposes a method for parsing a video into semantic steps in an unsupervised way, capable of providing a semantic "storyline" of the video composed of its objective steps.
Unsupervised Alignment of Natural Language Instructions with Video Segments
TLDR
An unsupervised learning algorithm for automatically inferring the mappings between English nouns and corresponding video objects and two generative models that are closely related to the HMM and IBM 1 word alignment models used in statistical machine translation are proposed.
Beyond verbs: Understanding actions in videos with text
  • Shujon Naha, Yang Wang
  • Computer Science
    2016 23rd International Conference on Pattern Recognition (ICPR)
  • 2016
TLDR
This work considers the problem of joint modeling of videos and their corresponding textual descriptions (e.g. sentences or phrases) and develops a joint model that links videos and text.
Learning Visually Grounded and Multilingual Representations
TLDR
A novel computational model of cross-situational word learning is proposed that takes images of natural scenes paired with their descriptions as input and incrementally learns probabilistic associations between words and image features.
Discriminative Unsupervised Alignment of Natural Language Instructions with Corresponding Video Segments
TLDR
This work proposes three latent-variable discriminative models that are capable of incorporating domain knowledge, by adding diverse and overlapping features to the unsupervised alignment task of natural language sentences with corresponding video segments.
Unsupervised Alignment of Natural Language with Video
TLDR
By exploiting the temporal ordering constraints between video and associated text, it is possible to automatically align the sentences in the text with the corresponding video frames without any direct human supervision.
Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild
TLDR
This paper proposes a strategy for generating textual descriptions of videos by using a factor graph to combine visual detections with language statistics, and uses state-of-the-art visual recognition systems to obtain confidences on entities, activities, and scenes present in the video.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 27 REFERENCES
Learning visually grounded words and syntax for a scene description task
  • D. Roy
  • Computer Science, Linguistics
    Comput. Speech Lang.
  • 2002
Video In Sentences Out
TLDR
A system that produces sentential descriptions of video: who did what to whom, and where and how they did it, with an approach to event recognition that recovers object tracks, the track-to-role assignments, and changing body posture.
On the Integration of Grounding Language and Learning Objects
TLDR
A multimodal learning system that can ground spoken names of objects in their physical referents and learn to recognize those objects simultaneously from naturally co-occurring multisensory input and incorporating the spatio-temporal and cross-modal constraints of multimodals.
A Probabilistic Model of Syntactic and Semantic Acquisition from Child-Directed Utterances and their Meanings
TLDR
An incremental probabilistic learner that models the acquistion of syntax and semantics from a corpus of child-directed utterances paired with possible representations of their meanings while also countering previous criticisms of statistical syntactic learners is presented.
Baby talk: Understanding and generating simple image descriptions
TLDR
A system to automatically generate natural language descriptions from images that exploits both statistics gleaned from parsing large quantities of text data and recognition algorithms from computer vision that is very effective at producing relevant sentences for images.
I2T: Image Parsing to Text Description
TLDR
An image parsing to text description (I2T) framework that generates text descriptions of image and video content based on image understanding and uses automatic methods to parse image/video in specific domains and generate text reports that are useful for real-world applications.
A Maximum-Likelihood Approach to Visual Event Classification
This paper presents a novel framework, based on maximum likelihood, for training models to recognise simple spatial-motion events, such as those described by the verbs pick up, put down, push, pull,
Collective Generation of Natural Image Descriptions
TLDR
A holistic data-driven approach to image description generation, exploiting the vast amount of (noisy) parallel image data and associated natural language descriptions available on the web to generate novel descriptions for query images.
Learning to sportscast: a test of grounded language acquisition
TLDR
A novel commentator system that learns language from sportscasts of simulated soccer games and uses a novel algorithm, Iterative Generation Strategy Learning (IGSL), for deciding which events to comment on.
...
1
2
3
...