Corpus ID: 17135229

A Corpus-Guided Framework for Robotic Visual Perception

  title={A Corpus-Guided Framework for Robotic Visual Perception},
  author={Ching Lik Teo and Yezhou Yang and Hal Daum{\'e} and Cornelia Ferm{\"u}ller and Yiannis Aloimonos},
  booktitle={Language-Action Tools for Cognitive Artificial Agents},
We present a framework that produces sentence-level summarizations of videos containing complex human activities that can be implemented as part of the Robot Perception Control Unit (RPCU). This is done via: 1) detection of pertinent objects in the scene: tools and direct-objects, 2) predicting actions guided by a large lexical corpus and 3) generating the most likely sentence description of the video given the detections. We pursue an active object detection approach by focusing on regions of… Expand
Robots with language: Multi-label visual recognition using NLP
This paper shows a way to learn a language model from large corpus data that exploits these correlations and proposes a general optimization scheme to integrate the language model into the system. Expand
Extracting common sense knowledge from text for robot planning
This work aims to automate the process of acquiring general common sense knowledge of objects, relations, and actions, by extracting such information from large amounts of natural language text, written by humans for human readers. Expand
Integrating multi-purpose natural language understanding, robot's memory, and symbolic planning for task execution in humanoid robots
This study elaborate on a framework that allows a humanoid robot to understand natural language, derive symbolic representations of its sensorimotor experience, generate complex plans according to the current world state, and monitor plan execution. Expand
Grounded spatial symbols for task planning based on experience
It is demonstrated how spatial knowledge can be extracted from these two sources of experience by combining the resulting knowledge in a systematic way yields a solution to the grounding problem which has the potential to substantially decrease the learning effort. Expand
Integration of various concepts and grounding of word meanings using multi-layered multimodal LDA for sentence generation
This study proposes multi-layered multimodal latent Dirichlet allocation (mMLDA) to realize the formation of various concepts, and the integration of those concepts, by robots, and proposes a method to infer which words are originally connected to a concept using mutual information between words and concepts. Expand
Learning word meanings and grammar for verbalization of daily life activities using multilayered multimodal latent Dirichlet allocation and Bayesian hidden Markov models
This paper proposes a novel method that enables a sensing system to verbalize an everyday scene it observes and develops a system to obtain multimodal data from human everyday activities. Expand


A survey of vision-based methods for action representation, segmentation and recognition
This survey focuses on approaches that aim on classification of full-body motions, such as kicking, punching, and waving, and categorizes them according to how they represent the spatial and temporal structure of actions. Expand
Learning realistic human actions from movies
A new method for video classification that builds upon and extends several recent ideas including local space-time features,space-time pyramids and multi-channel non-linear SVMs is presented and shown to improve state-of-the-art results on the standard KTH action dataset. Expand
Action Recognition in Videos: from Motion Capture Labs to the Web
An organizing framework is proposed which puts in evidence the evolution of the area, with techniques moving from heavily constrained motion capture scenarios towards more challenging, realistic, “in the wild” videos. Expand
Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition
This work presents a Bayesian approach which applies spatial and functional constraints on each of the perceptual elements for coherent semantic interpretation and demonstrates the use of such constraints in recognition of actions from static images without using any motion information. Expand
Every Picture Tells a Story: Generating Sentences from Images
A system that can compute a score linking an image to a sentence, which can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence. Expand
Movie/Script: Alignment and Parsing of Video and Text Transcription
A weakly supervised algorithm is presented that uses the screenplay and closed captions to parse a movie into a hierarchy of shots and scenes, and the recovered structure is used to improve character naming and retrieval of common actions in several episodes of popular TV series. Expand
Machine Recognition of Human Activities: A Survey
A comprehensive survey of efforts in the past couple of decades to address the problems of representation, recognition, and learning of human activities from video and related applications is presented. Expand
Who's Doing What: Joint Modeling of Names and Verbs for Simultaneous Face and Pose Annotation
A joint model for simultaneously solving the image-caption correspondences and learning visual appearance models for the face and pose classes occurring in the corpus can be used to recognize people and actions in novel images without captions. Expand
Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary
This work shows how to cluster words that individually are difficult to predict into clusters that can be predicted well, and cannot predict the distinction between train and locomotive using the current set of features, but can predict the underlying concept. Expand
A Game-Theoretic Approach to Generating Spatial Descriptions
It is shown that a speaker model that acts optimally with respect to an explicit, embedded listener model substantially outperforms one that is trained to directly generate spatial descriptions. Expand