Sequence-Level Self-Learning with Multiple Hypotheses

@article{Kumatani2020SequenceLevelSW,
  title={Sequence-Level Self-Learning with Multiple Hypotheses},
  author={Ken'ichi Kumatani and Dimitrios Dimitriadis and Yashesh Gaur and Robert Gmyr and Sefik Emre Eskimez and Jinyu Li and Michael Zeng},
  journal={ArXiv},
  year={2020},
  volume={abs/2112.05826}
}
In this work, we develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR). For untranscribed speech data, the hypothesis from an ASR system must be used as a label. However, the imperfect ASR result makes unsupervised learning difficult to consistently improve recognition performance especially in the case that multiple powerful teacher models are unavailable. In contrast to conventional unsupervised learning… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 39 REFERENCES

A Teacher-Student Learning Approach for Unsupervised Domain Adaptation of Sequence-Trained ASR Models

TLDR
This work compares this sequence-level KL divergence objective with another semi-supervised sequence-training method, namely the lattice-free MMI, for unsupervised domain adaptation, and investigates the approaches in multiple scenarios including adapting from clean to noisy speech, bandwidth mismatch and channel mismatch.

Large-Scale Domain Adaptation via Teacher-Student Learning

TLDR
This work proposes an approach to domain adaptation that does not require transcriptions but instead uses a corpus of unlabeled parallel data, consisting of pairs of samples from the source domain of the well-trained model and the desired target domain, to perform adaptation.

Domain Adaptation via Teacher-Student Learning for End-to-End Speech Recognition

TLDR
This work extends the T/S learning to large-scale unsupervised domain adaptation of an attention-based end-to-end (E2E) model through two levels of knowledge transfer: teacher's token posteriors as soft labels and one-best predictions as decoder guidance.

A Comparison of Sequence-to-Sequence Models for Speech Recognition

TLDR
It is found that the sequence-to-sequence models are competitive with traditional state-of-the-art approaches on dictation test sets, although the baseline, which uses a separate pronunciation and language model, outperforms these models on voice-search test sets.

End-to-end attention-based large vocabulary speech recognition

TLDR
This work investigates an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels.

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

TLDR
This work extends state-of-the-art attention-based automatic speech recognition (ASR) systems with synthetic audio generated by a TTS system trained only on the ASR corpora itself, closing the gap to a comparable oracle experiment by more than 50%.

Sequence-level Knowledge Distillation for Model Compression of Attention-based Sequence-to-sequence Speech Recognition

TLDR
This work first uses a pre-trained larger teacher model to generate multiple hypotheses per utterance with beam search, and then trains the student model using these hypotheses generated from the teacher as pseudo labels in place of the original ground truth labels.

Attention-Based Models for Speech Recognition

TLDR
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.

Realizing Petabyte Scale Acoustic Modeling

TLDR
This work utilizes semi-supervised learning (SSL) to learn acoustic models (AM) from the vast firehose of untranscribed audio data, and presents the design and evaluation of a highly scalable and resource efficient SSL system for AM.

Sequence-Level Knowledge Distillation

TLDR
It is demonstrated that standard knowledge distillation applied to word-level prediction can be effective for NMT, and two novel sequence-level versions of knowledge distilling are introduced that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search.