Talking Heads: Detecting Humans and Recognizing Their Interactions

  title={Talking Heads: Detecting Humans and Recognizing Their Interactions},
  author={Minh Hoai and Andrew Zisserman},
  journal={2014 IEEE Conference on Computer Vision and Pattern Recognition},
  • Minh HoaiAndrew Zisserman
  • Published 23 June 2014
  • Computer Science
  • 2014 IEEE Conference on Computer Vision and Pattern Recognition
The objective of this work is to accurately and efficiently detect configurations of one or more people in edited TV material. [] Key Method We make the following contributions: first, we introduce a new learnable context aware configuration model for detecting sets of people in TV material that predicts the scale and location of each upper body in the configuration, second, we show that inference of the model can be solved globally and efficiently using dynamic programming, and implement a maximum margin…

Figures and Tables from this paper

The One Where They Reconstructed 3D Humans and Environments in TV Shows

This paper proposes an automatic approach that operates on an entire season of a TV show and aggregates information in 3D; it builds a 3D model of the environment, compute camera information, static 3D scene structure and body scale information and demonstrates how this information acts as rich 3D context ⋆ Equal contribution.

Following Gaze in Video

An approach for following gaze in video by predicting where a person (in the video) is looking even when the object is in a different frame, using VideoGaze, a new dataset which is used as a benchmark to both train and evaluate models.

Context-Aware CNNs for Person Head Detection

This work leverage person-scene relations and propose a global CNN model trained to predict positions and scales of heads directly from the full image via energy-based model where the potentials are computed with a CNN framework.

Online Localization and Prediction of Actions and Interactions

A person-centric and online approach to the challenging problem of localization and prediction of actions and interactions in videos and proposes an approach based on Structural SVM that operates on short video segments, and is trained with the objective that confidence of an action or interaction increases as time passes in a positive training clip.

Analyzing human-human interactions: A survey

Understanding human-human interactions: a survey

A summary of challenges of dealing with the considerable variation in recording settings, the appearance of the people depicted and the performance of their interaction, and directions to overcome the limitations of the current state-of-the-art are outlined.

Pulling Actions out of Context: Explicit Separation for Effective Combination

  • Y. WangMinh Hoai
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
A novel approach for training a human action recognizer that can explicitly factorize human actions from the co-occurring factors, and deliberately build a model for human actions and a separate model for all correlated contextual elements.

Spatial-temporal dual-actor CNN for human interaction prediction in video

A new interaction prediction method that has a high accuracy in detecting the interactions when a small percentage of the video is viewed and gives improvements on standard interaction recognition datasets including the TV Human Interaction, BIT interaction and UT Interaction.

Following Gaze Across Views

An approach for following gaze across views by predicting where a particular person is looking throughout a scene by building an end-to-end model that solves the following sub-problems: saliency, gaze pose, and geometric relationships between views.

Online, Supervised and Unsupervised Action Localization in Videos

The overall aim of this dissertation is action localization, which presents an efficient approach for localizing actions by learning contextual relations between different video regions in training and proposes an online approach to the challenging problem of localization and prediction of actions/interactions in videos.



Detecting People Looking at Each Other in Videos

The objective of this work is to determine if people are interacting in TV video by detecting whether they are looking at each other or not. We determine both the temporal period of the interaction

Structured Learning of Human Interactions in TV Shows

It is shown that inference can be carried out with polynomial complexity in the number of people, and an efficient algorithm is described that is evaluated on a new dataset comprising 300 video clips acquired from 23 different TV shows and on the benchmark UT--Interaction dataset.

Recognizing proxemics in personal photos

This work presents a computational formulation of visual proxemics by attempting to label each pair of people in an image with a subset of physically based “touch codes” by building an articulated model tuned for each touch code.

Pictorial structures revisited: People detection and articulated pose estimation

This paper proposes a generic approach based on the pictorial structures framework, and demonstrates that such a model is equally suitable for both detection and pose estimation tasks, outperforming the state of the art on three recently proposed datasets.

Progressive search space reduction for human pose estimation

An approach that progressively reduces the search space for body parts, to greatly improve the chances that pose estimation will succeed, and an integrated spatio- temporal model covering multiple frames to refine pose estimates from individual frames, with inference using belief propagation.

Weakly Supervised Learning of Interactions between Humans and Objects

An extensive experimental evaluation on the sports action data set from [1], the PASCAL Action 2010 data set [2], and a new human-object interaction data set are presented.

Detection and Tracking of Occluded People

This work observes that typical occlusions are due to overlaps between people and proposes a people detector tailored to various occlusion levels, and leverages the fact that person/person Occlusion result in very characteristic appearance patterns that can help to improve detection results.

We Are Family: Joint Pose Estimation of Multiple Persons

A novel multi-person pose estimation framework, which extends pictorial structures (PS) to explicitly model interactions between people and to estimate their poses jointly, resulting in better pose estimates in group photos, where several persons stand nearby and occlude each other.

Cascaded Models for Articulated Pose Estimation

This work proposes to learn a sequence of structured models at different pose resolutions, where coarse models filter the pose space for the next level via their max-marginals, and trains the cascade to prune as much as possible while preserving true poses for the final level pictorial structure model.