Content-Context Factorized Representations for Automated Speech Recognition

@inproceedings{Chan2022ContentContextFR,
  title={Content-Context Factorized Representations for Automated Speech Recognition},
  author={David Chan and Shalini Ghosh},
  booktitle={Interspeech},
  year={2022}
}
Deep neural networks have largely demonstrated their ability to perform automated speech recognition (ASR) by extracting meaningful features from input audio frames. Such features, however, may consist not only of information about the spoken language content, but also may contain information about unnec-essary contexts such as background noise and sounds or speaker identity, accent, or protected attributes. Such information can directly harm generalization performance, by introducing spurious… 

Figures and Tables from this paper

Self-Supervised Speech Representation Learning: A Review

This review presents approaches for self-supervised speech representation learning and their connection to other research areas, and reviews recent efforts on benchmarking learned representations to extend the application beyond speech recognition.

References

SHOWING 1-10 OF 26 REFERENCES

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.

Speech recognition with deep recurrent neural networks

This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

This work introduces Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units.

Disentanglement by Cyclic Reconstruction

This work proposes an original method, combining adversarial feature predictors and cyclic reconstruction, to disentangle these two representations in the single-domain supervised case, and demonstrates the quality of the representations on information retrieval tasks and the generalization benefits induced by sharpened task-specific representations.

Identity Conversion for Emotional Speakers: A Study for Disentanglement of Emotion Style and Speaker Identity

This work proposes an expressive voice conversion framework which can effectively disentangle linguistic content, speaker identity, pitch, and emotional style information, and introduces mutual information losses to reduce the irrelevant information from the disentangled emotion representations.

Multi-Modal Pre-Training for Automated Speech Recognition

This work introduces a novel approach that leverages a self-supervised learning technique based on masked language modeling to compute a global, multi-modal encoding of the environment in which the utterance occurs and uses a new deep-fusion framework to integrate this global context into a traditional ASR method.

Using Multiple Reference Audios and Style Embedding Constraints for Speech Synthesis

The proposed model can improve the speech naturalness and content quality with multiple reference audios and can also outperform the baseline model in ABX preference tests of style similarity.

Scenario Aware Speech Recognition: Advancements for Apollo Fearless Steps & CHiME-4 Corpora

It is demonstrated that triplet-loss based embedding performs better than i-Vector in acoustic modeling, confirming that the triplet loss is more effective than a speaker feature in ASR.

Disentangled Speaker and Language Representations Using Mutual Information Minimization and Domain Adaptation for Cross-Lingual TTS

The proposed method for obtaining disentangled speaker and language representations via mutual information minimization and domain adaptation for cross-lingual text-to-speech (TTS) synthesis significantly improves the naturalness and speaker similarity of both intra-lingUAL and cross-lingsual TTS synthesis.

Hubert: How Much Can a Bad Teacher Benefit ASR Pre-Training?

The Hidden-Unit BERT (HUBERT) model is proposed which utilizes a cheap k-means clustering step to provide aligned target labels for pre-training of a BERT model and allows the pre- training stage to benefit from the consistency of the unsupervised teacher rather that its intrinsic quality.