Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

@article{Jiang2022EgocentricDM,
  title={Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization},
  author={Hao Jiang and Calvin Murdock and Vamsi Krishna Ithapu},
  journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022},
  pages={10534-10542}
}
Augmented reality devices have the potential to enhance human perception and enable other assistive functionalities in complex conversational environments. Effectively capturing the audio-visual context necessary for understanding these social interactions first requires detecting and localizing the voice activities of the device wearer and the surrounding people. These tasks are challenging due to their egocentric nature: the wearer's head motion may cause motion blur, surrounding people may… 

Figures and Tables from this paper

Few-Shot Audio-Visual Learning of Environment Acoustics

A transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-att attention is introduced, and it is demonstrated that this method successfully generates arbitrary R IRs, outperforming state-of-the-art methods and—in a major departure from traditional methods—generalizing to novel environments in a few-shot manner.

References

SHOWING 1-10 OF 28 REFERENCES

VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

  • Ruohan GaoK. Grauman
  • Computer Science
    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2021
This work proposes to leverage the speaker’s face appearance as an additional prior to isolate the corresponding vocal qualities they are likely to produce from unlabeled video, and yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.

Self-Supervised Moving Vehicle Tracking With Stereo Sound

This work proposes a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time, and demonstrates that the proposed approach outperforms several baseline approaches.

SoundSpaces: Audio-Visual Navigation in 3D Environments

This work proposes a multi-modal deep reinforcement learning approach to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to discover elements of the geometry of the physical space indicated by the reverberating audio and detect and follow sound-emitting targets.

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.

Binaural Audio-Visual Localization

A novel Binaural Audio-Visual Network (BAVNet), which concurrently extracts and integrates features from binaural recordings and videos and a point-annotation strategy to construct pixel-level ground truth for network training and performance evaluation is proposed.

Learning to Separate Object Sounds by Watching Unlabeled Video

This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising.

See the Sound, Hear the Pixels

A novel algorithm is proposed that addresses the problem of localizing sound source in unconstrained videos, which uses efficient fusion and attention mechanisms and demonstrates a significant increase in performance over the existing state-of-the-art methods.

Looking to listen at the cocktail party

A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.

The Right to Talk: An Audio-Visual Transformer Approach

This work introduces a new Audio-Visual Transformer approach to the problem of localization and highlighting the main speaker in both audio and visual channels of a multi-speaker conversation video in the wild, and is one of the first studies that is able to automatically localize and highlight the main speakers in both visual and audio channels in multi-Speaker conversation videos.

Localizing Visual Sounds the Hard Way

The key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, the authors can significantly boost the localization performance by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically.