Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization
@article{Jiang2022EgocentricDM, title={Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization}, author={Hao Jiang and Calvin Murdock and Vamsi Krishna Ithapu}, journal={2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2022}, pages={10534-10542} }
Augmented reality devices have the potential to enhance human perception and enable other assistive functionalities in complex conversational environments. Effectively capturing the audio-visual context necessary for understanding these social interactions first requires detecting and localizing the voice activities of the device wearer and the surrounding people. These tasks are challenging due to their egocentric nature: the wearer's head motion may cause motion blur, surrounding people may…
Figures and Tables from this paper
3 Citations
Chat2Map: Efficient Scene Mapping from Multi-Ego Conversations
- Computer ScienceArXiv
- 2023
An audio-visual deep reinforcement learning approach that works with the authors' shared scene mapper to selectively turn on the camera to efficiently chart out the space, and achieves an excellent cost-accuracy tradeoff.
Few-Shot Audio-Visual Learning of Environment Acoustics
- Computer ScienceArXiv
- 2022
A transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-att attention is introduced, and it is demonstrated that this method successfully generates arbitrary R IRs, outperforming state-of-the-art methods and—in a major departure from traditional methods—generalizing to novel environments in a few-shot manner.
Novel-View Acoustic Synthesis
- Computer ScienceArXiv
- 2023
This work proposes a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues.
References
SHOWING 1-10 OF 28 REFERENCES
VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
This work proposes to leverage the speaker’s face appearance as an additional prior to isolate the corresponding vocal qualities they are likely to produce from unlabeled video, and yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
Self-Supervised Moving Vehicle Tracking With Stereo Sound
- Computer Science2019 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2019
This work proposes a system that can leverage unlabeled audiovisual data to learn to localize objects (moving vehicles) in a visual reference frame, purely using stereo sound at inference time, and demonstrates that the proposed approach outperforms several baseline approaches.
SoundSpaces: Audio-Visual Navigation in 3D Environments
- Computer ScienceECCV
- 2020
This work proposes a multi-modal deep reinforcement learning approach to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to discover elements of the geometry of the physical space indicated by the reverberating audio and detect and follow sound-emitting targets.
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
- Computer ScienceECCV
- 2018
It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.
Binaural Audio-Visual Localization
- Physics, Computer ScienceAAAI
- 2021
A novel Binaural Audio-Visual Network (BAVNet), which concurrently extracts and integrates features from binaural recordings and videos and a point-annotation strategy to construct pixel-level ground truth for network training and performance evaluation is proposed.
Learning to Separate Object Sounds by Watching Unlabeled Video
- Computer ScienceECCV
- 2018
This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising.
Looking to listen at the cocktail party
- Computer ScienceACM Trans. Graph.
- 2018
A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.
The Right to Talk: An Audio-Visual Transformer Approach
- Computer Science2021 IEEE/CVF International Conference on Computer Vision (ICCV)
- 2021
This work introduces a new Audio-Visual Transformer approach to the problem of localization and highlighting the main speaker in both audio and visual channels of a multi-speaker conversation video in the wild, and is one of the first studies that is able to automatically localize and highlight the main speakers in both visual and audio channels in multi-Speaker conversation videos.
Localizing Visual Sounds the Hard Way
- Computer Science2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
- 2021
The key technical contribution is to show that, by training the network to explicitly discriminate challenging image fragments, even for images that do contain the object emitting the sound, the authors can significantly boost the localization performance by introducing a mechanism to mine hard samples and add them to a contrastive learning formulation automatically.
Is Someone Speaking?: Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection
- Computer ScienceACM Multimedia
- 2021
TalkNet is a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration in active speaker detection, and achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively.