Sequence-Based Multi-Lingual Low Resource Speech Recognition

  title={Sequence-Based Multi-Lingual Low Resource Speech Recognition},
  author={Siddharth Dalmia and Ramon Sanabria and Florian Metze and Alan W. Black},
  journal={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
Techniques for multi-lingual and cross-lingual speech recognition can help in low resource scenarios, to bootstrap systems and enable analysis of new languages and domains. [] Key Result Here, it appears beneficial to include large well prepared datasets.

Tables from this paper

Phoneme Level Language Models for Sequence Based Low Resource ASR

This paper proposes a phoneme-level language model that can be used multilingually and for crosslingual adaptation to a target language and shows that this model performs almost as well as the monolingual models, and is capable of better adaptation to languages not seen during training in a low resource scenario.

Analysis of Multilingual Sequence-to-Sequence speech recognition systems

This paper investigates the applications of various multilingual approaches developed in conventional hidden Markov model (HMM) systems to sequence-to-sequence (seq2seq) automatic speech recognition (ASR) and found multilingual features superior to multilingual models.

Cross-Lingual Self-training to Learn Multilingual Representation for Low-Resource Speech Recognition

A new pre-training framework, Cross-Lingual Self-Training (XLST), is proposed to further improve the effectiveness for multilingual representation learning and employs the moving average and multi-view data augmentation mechanisms to better generalize the learned representations.

Cross-Lingual Transfer for Speech Processing Using Acoustic Language Similarity

This work proposes a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages and demonstrates the effectiveness of this approach in language family classification, speech recognition, and speech synthesis tasks.

Large-Scale End-to-End Multilingual Speech Recognition and Language Identification with Multi-Task Learning

A large-scale end-to-end languageindependent multilingual model for joint automatic speech recognition (ASR) and language identification (LID) and achieves word error rate (WER) of 52.8 and LID accuracy of 93.5 on 42 languages with around 5000 hours of training data is reported.

Cross-lingual adaptation of a CTC-based multilingual acoustic model

Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation

Transfer learning from high-resource languages is known to be an efficient way to improve end-to-end automatic speech recognition (ASR) for low-resource languages. Pre-trained or jointly trained

Leveraging Language ID in Multilingual End-to-End Speech Recognition

This paper introduces a novel technique for inferring the language ID in a streaming fashion using RNN-T, and a novel loss function that pressures the model to identify the language after as few frames as possible.

Zero-Shot Cross-Lingual Phonetic Recognition with External Language Embedding

This paper argues that in the real world, even an unseen language has metadata: linguists can tell us the language name, its language family and, usually, its phoneme inventory and it is pos-sible to train a language embedding using only data from language typologies that reduces ASR error rates.

Adaptive Activation Network for Low Resource Multilingual Speech Recognition

This work introduced an adaptive activation network to the upper layers of ASR model, and applied different activation functions to different languages, and proposed two approaches to train the model: cross-lingual learning and multilingual learning, which could further improve the performance of multilingual speech recognition.



Investigation of multilingual deep neural networks for spoken term detection

STT gains achieved through using multilingual bottleneck features in a Tandem configuration are shown to also apply to keyword search (KWS), and improvements in both STT and KWS were observed by incorporating language questions into the Tandem GMM-HMM decision trees for the training set languages.

Language Adaptive Multilingual CTC Speech Recognition

It is demonstrated that setups with multilingual phone sets benefit from the addition of Language Feature Vectors (LFVs), and a similar technique using sequence based neural network acoustic models with Connectionist Temporal Classification (CTC) loss function is proposed.

EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding

This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.

Multilingual acoustic models using distributed deep neural networks

Experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total show average relative gains over the monolingual baselines, but additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks.

An Investigation of Deep Neural Networks for Multilingual Speech Recognition Training and Adaptation

By combining state-level minimum Bayes risk (sMBR) sequence training with LAT, it is shown that a language adaptively trained IPA-based universal DNN outperforms a monolingually sequence trained model.

The language-independent bottleneck features

This paper presents novel language-independent bottleneck (BN) feature extraction framework, where each language is modelled by separate output layer, while all the hidden layers jointly model the variability of all the source languages.

Language-independent and language-adaptive acoustic modeling for speech recognition

Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR

This work investigates the use of cross-lingual acoustic data to initialise deep neural network (DNN) acoustic models by means of unsupervised restricted Boltzmann machine (RBM) pre-training and shows that unsuper supervised pretraining is more crucial for the hybrid setups, particularly with limited amounts of transcribed training data.

Multilingual bottle-neck features and its application for under-resourced languages

This paper shows that the overall performance of a Multilayer Perceptron (MLP) network improves significantly, and proposes a new strategy called “open target language” MLP to train more flexible models for language adaptation, which is particularly suited for small amounts of training data.

Integrating multilingual articulatory features into speech recognition

This paper shows that using cross- and multilingual detectors to support an HMM based speech recognition system significantly reduces the word error rate.