Robust Self-Supervised Audio-Visual Speech Recognition

  title={Robust Self-Supervised Audio-Visual Speech Recognition},
  author={Bowen Shi and Wei-Ning Hsu and Abdel-rahman Mohamed},
Audio-based automatic speech recognition (ASR) degrades significantly in noisy environments and is particularly vulnerable to interfering speech, as the model cannot determine which speaker to transcribe. Audio-visual speech recognition (AVSR) systems improve robustness by complementing the audio stream with the visual information that is invariant to noise and helps the model focus on the desired speaker. However, previous AVSR work focused solely on the supervised learning setup; hence the… 

Figures and Tables from this paper

A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer

While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled

Multi-Variant Consistency based Self-supervised Learning for Robust Automatic Speech Recognition

The robust ASR is addressed by introducing a multi- variant consistency (MVC) based SSL method that adapts to different environments and can achieve up to 30% relative word error rate reductions over the baseline wav2vec2.0, one of the most successful SSL methods for ASR.

Masked Autoencoders that Listen

Audio-MAE is a simple extension of image-based Masked Autoencoders to self-supervised representation learning from audio spectrograms, outperforming other recent models that use external supervised pre-training.

Self-Supervised Speech Representation Learning: A Review

This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.

MM-ALT: A Multimodal Automatic Lyric Transcription System

The MultiModal Automatic Lyric Transcription system (MM-ALT), together with a new dataset, N20EM, which consists of audio recordings, videos of lip movements, and inertial measurement unit (IMU) data of an earbud worn by the performing singer, is proposed.

Learning in Audio-visual Context: A Review, Analysis, and New Perspective

This survey reviews and outlooks the current audio-visual learning from different aspects and hopes it can provide researchers with a better understanding of this area.

Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos

The first method for visual speech-aware perceptual reconstruction of 3D for in datasets is presented, verified through exhaustive objective evaluations on three large-scale datasets, as well as subjective evaluation with two web-based user studies.

Visually-aware Acoustic Event Detection using Heterogeneous Graphs

Through heterogeneous graphs, this work shows efficiently modelling of intra- and inter-modality relationships both at spatial and temporal scales and can easily be adapted to different scales of events through relevant hyperparameters.

Using Lip Reading Recognition to Predict Daily Mandarin Conversation

A lip reading recognition model to predict daily Mandarin conversation and collect a new Daily Mandarin Conversation Lip Reading (DMCLR) dataset, consisting of 1,000 videos from 100 daily conversations spoken by ten speakers is proposed.



Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

This work introduces Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units.

Discriminative Multi-Modality Speech Recognition

A two-stage speech recognition model that consistently achieves the state-of-the-art performance by a significant margin is proposed, which demonstrates the necessity and effectiveness of AE-MSR.

Deep Audio-Visual Speech Recognition

This work compares two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss, built on top of the transformer self-attention architecture.

End-To-End Audio-Visual Speech Recognition with Conformers

This work presents a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner and raises the state-of-the-art performance by a large margin in audio-only, visual- only, and audio-visual experiments.

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

A new pre-trained model, WavLM, to solve full-stack downstream speech tasks, which achieves state-of-the-art performance on the SUPERB benchmark, and brings improvements for various speech processing tasks on their representative benchmarks.

Improving Noise Robust Automatic Speech Recognition with Single-Channel Time-Domain Enhancement Network

It is shown that a single-channel time-domain denoising approach can significantly improve ASR performance, providing more than 30 % relative word error reduction over a strong ASR back-end on the real evaluation data of the single- channel track of the CHiME-4 dataset.

Multiple cameras audio visual speech recognition using active appearance model visual features in car environment

The shape and appearance information are extracted from jaw and lip region to enhance the performance in vehicle environments to show more robustness compared to acoustic speech recognizer across all driving conditions.

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

The Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss.

The Conversation: Deep Audio-Visual Speech Enhancement

A deep audio-visual speech enhancement network that is able to separate a speaker's voice given lip regions in the corresponding video, by predicting both the magnitude and the phase of the target signal.

Super-Human Performance in Online Low-latency Recognition of Conversational Speech

Results are presented for a system that can achieve super-human performance (at a WER of 5.0%, over the Switchboard conversational benchmark) at a word based latency of only 1 second behind a speaker's speech.