Looking to listen at the cocktail party

@article{Ephrat2018LookingTL,
  title={Looking to listen at the cocktail party},
  author={Ariel Ephrat and Inbar Mosseri and Oran Lang and Tali Dekel and Kevin W. Wilson and Avinatan Hassidim and William T. Freeman and Michael Rubinstein},
  journal={ACM Transactions on Graphics (TOG)},
  year={2018},
  volume={37},
  pages={1 - 11}
}
We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired… 
Listen, Watch and Understand at the Cocktail Party: Audio-Visual-Contextual Speech Separation
TLDR
This paper tries to solve the speakerindependent speech separation problem with all three audiovisual- contextual modalities at the first time, and shows that a significant performance improvement can be observed with the newly proposed audio-visual-contextual speech separation.
Learning Audio-Visual Dereverberation
TLDR
Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene, achieves state-of-the-art performance and substantially improves over traditional audio-only methods.
My lips are concealed: Audio-visual speech enhancement through obstructions
TLDR
A deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice by learning the representation on-the-fly given sufficient unobstructed visual input.
Listening to Sounds of Silence for Speech Denoising
TLDR
A deep learning model for speech denoising, a long-standing challenge in audio analysis arising in numerous applications, based on a key observation about human speech: there is often a short pause between each sentence or word, which exposes not just pure noise but its time-varying features.
Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
TLDR
A unified framework for multi-modal speech separation and enhancement based on synchronous or asynchronous cues is presented and a modern Transformer-based architecture tailored to fuse different modalities to solve the speech separation task in the raw waveform domain is designed.
The Conversation: Deep Audio-Visual Speech Enhancement
TLDR
A deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal.
Audio-Visual Speech Super-Resolution
TLDR
An audio-visual model to perform speech super-resolution at large scale-factors (8 × and 16 ×) and a “pseudo-visual network” that precisely synthesizes the visual stream solely from the low-resolution speech input is presented.
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
TLDR
It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.
Advances in Online Audio-Visual Meeting Transcription
TLDR
A system that generates speaker-annotated transcripts of meetings by using a microphone array and a 360-degree camera and an online audio-visual speaker diarization method that leverages face tracking and identification, sound source localization, speaker identification and, if available, prior speaker information for robustness to various real world challenges is described.
...
...

References

SHOWING 1-10 OF 58 REFERENCES
Seeing Through Noise: Speaker Separation and Enhancement using Visually-derived Speech
TLDR
It is proposed to use visual information of face and mouth movements as seen in the video to enhance the voice of a speaker, and in particular eliminate sounds that do not relate to the face movements.
Audio-Visual Sound Separation Via Hidden Markov Models
TLDR
A method to exploit audio-visual cues to enable speech separation under non-stationary noise and with a single microphone is proposed, showing promising improvements in machine intelligibility for speech enhanced using audio and visual information.
The Conversation: Deep Audio-Visual Speech Enhancement
TLDR
A deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal.
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
TLDR
It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.
Audio-visual speaker separation
TLDR
Experimental results are presented that compare the proposed audio-visual speaker separation with the audio-only method using both speech quality and intelligibility metrics.
Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers
TLDR
This work presents a system that associates faces with voices in a video by fusing information from the audio and visual signals to effectively associate faces and voices by aggregating statistics across a video.
Visual Speech Enhancement using Noise-Invariant Training
TLDR
This work proposes an audio-visual neural network model for speech enhancement that outperforms prior audio visual methods on two public lipreading datasets and is the first to be demonstrated on a general dataset not designed for lipreading.
Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus
TLDR
A challenging audio-visual database that is flexible and fairly comprehensive, yet easily available to researchers on one DVD is introduced, and methods and results in an attempt to make these techniques more robust to speaker movement are presented.
Learning to Separate Object Sounds by Watching Unlabeled Video
TLDR
This work is the first to learn audio source separation from large-scale "in the wild" videos containing multiple audio sources per video, and obtains state-of-the-art results on visually-aided audio sources separation and audio denoising.
Audiovisual Speech Source Separation: An overview of key methodologies
TLDR
Success in this emerging field will expand the application of voice-based machine interfaces, such as Siri, the intelligent personal assistant on the iPhone and iPad, to much more realistic settings and thereby provide more natural human?machine interfaces.
...
...