• Corpus ID: 245769552

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

  title={Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction},
  author={Bowen Shi and Wei-Ning Hsu and Kushal Lakhotia and Abdel-rahman Mohamed},
Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker’s lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech… 

Figures and Tables from this paper

Robust Self-Supervised Audio-Visual Speech Recognition

This work presents a self-supervised AVSR framework built upon Audio-Visual HuBERT (AV-HuBERT), a state-of-the-art audio-visual speech representation learning model, which outperforms prior state of the art AVSR systems on the largest available AVSR benchmark dataset LRS3.

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

The proposed approach explores both the complementarity of audio-visual modalities and long-term context dependency using a transformer-based fusion module and a accessible masking strategy, which can be applied to single-modal tasks, e.g. audio/visual speech recognition and lipreading.

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, and shows that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38% in the clean condition and 75% in noisy conditions.

Predict-and-Update Network: Audio-Visual Speech Recognition Inspired by Human Speech Perception

A Predict-and-Update Network (P&U net) is proposed, to simulate a visual cueing mechanism for Audio-Visual Speech Recognition (AVSR), which outperforms the state-of-the- art AVSR methods on both LRS2-BBC and LRS3-BBC datasets, with the relative reduced Word Error Rate (WER)s exceeding 10% and 40% under clean and noisy conditions, respectively.

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition

This work proposes to replace the 3D convolution with a video transformer to extract visual features in speech recognition and achieves the state of the art performance of the audio-visual recognition on the LRS3-TED after Tuning the model.

A Single Self-Supervised Model for Many Speech Modalities Enables Zero-Shot Modality Transfer

While audio-visual speech models can yield superior performance and robustness compared to audio-only models, their development and adoption are hindered by the lack of labeled and unlabeled

Dual-path Attention is All You Need for Audio-Visual Speech Extraction

A new way to fuse audio-visual features by replacing the LSTM in DPRNN with interchunk attention, which incorporates the visual features as an additional feature stream and achieves superior results compared with other time-domain based audio- visual fusion models.

SVTS: Scalable Video-to-Speech Synthesis

This work introduces a scalable video-to-speech framework consisting of two components: a video- to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio.

Multi-Modal Pre-Training for Automated Speech Recognition

This work introduces a novel approach that leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs and uses a new deep-fusion framework to integrate this global context into a traditional ASR method.

Self-Supervised Speech Representation Learning: A Review

This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.



Robust Audio-visual Speech Recognition Using Bimodal Dfsmn with Multi-condition Training and Dropout Regularization

A bimodal-DFSMN is proposed to jointly learn feature fusion and acoustic modeling, and a per-frame dropout approach is utilized to enhance the robustness of AVSR system against the missing of visual modality.

Modality Dropout for Improved Performance-driven Talking Faces

This work uses subjective testing to demonstrate the improvement of audiovisual-driven animation over the equivalent video-only approach, and the improvement in the animation of speech-related facial movements after introducing modality dropout.

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.

Recurrent Neural Network Transducer for Audio-Visual Speech Recognition

This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture and significantly improves the state-of-the-art on the LRS3-TED set.

LiRA: Learning Visual Speech Representations from Audio through Self-supervision

This work trains a ResNet+Conformer model to predict acoustic features from unlabelled visual speech and finds that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments.

Large-Scale Visual Speech Recognition

This work designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequence of phoneme distributions, and a production-level speech decoder that outputs sequences of words.

Discriminative Multi-Modality Speech Recognition

A two-stage speech recognition model that consistently achieves the state-of-the-art performance by a significant margin is proposed, which demonstrates the necessity and effectiveness of AE-MSR.

Deep Audio-Visual Speech Recognition

This work compares two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss, built on top of the transformer self-attention architecture.

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

The Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss.

Multi-Task Self-Supervised Learning for Robust Speech Recognition

PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions.