Multilingual and Crosslingual Speech Recognition Using Phonological-Vector Based Phone Embeddings

  title={Multilingual and Crosslingual Speech Recognition Using Phonological-Vector Based Phone Embeddings},
  author={Chengrui Zhu and Keyu An and Huahuan Zheng and Zhijian Ou},
  journal={2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  • Chengrui Zhu, Keyu An, Zhijian Ou
  • Published 11 July 2021
  • Linguistics, Computer Science
  • 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
The use of phonological features (PFs) potentially allows language-specific phones to remain linked in training, which is highly desirable for information sharing for multilingual and crosslingual speech recognition methods for low-resourced languages. A drawback suffered by previous methods in using phonological features is that the acoustic-to-PF extraction in a bottom-up way is itself difficult. In this paper, we propose to join phonology driven phone embedding (top-down) and deep neural… 

Figures and Tables from this paper

Hierarchical Softmax for End-to-End Low-resource Multilingual Speech Recognition

This paper assumes similar units in neighbour language share similar term frequency and form a Huffman tree to perform multi-lingual hierarchical Softmax decoding and shows the effectiveness of this method.



Common Voice: A Massively-Multilingual Speech Corpus

This work presents speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit, and finds an average Character Error Rate improvement for twelve target languages, for most of these languages, these are the first ever published results on end- to-end Automatic Speech Recognition.

Towards Zero-shot Learning for Automatic Phonemic Transcription

This model is able to recognize unseen phonemes in the target language without any training data and achieves 7.7% better phoneme error rate on average over a standard multilingual model.

Multilingual Training and Cross-lingual Adaptation on CTC-based Acoustic Model

Experiments show that the performance of the universal phoneme-based CTC system can be improved by applying LHUC and it is extensible to new phonemes during cross-lingual adaptation and applying dropout during adaptation can further improve the system and achieve competitive performance with Deep Neural Network / Hidden Markov Model (DNN/HMM) systems on limited data.

Multilingual acoustic models using distributed deep neural networks

Experimental results for cross- and multi-lingual network training of eleven Romance languages on 10k hours of data in total show average relative gains over the monolingual baselines, but additional gain from jointly training the languages on all data comes at an increased training time of roughly four weeks.

Multilingual and Crosslingual Speech Recognition

The design of a multilingual speech recognizer is described using an LVCSR dictation database which has been collected under the project GlobalPhone and built on a global phoneme set which can handle five different languages.

CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency

Compared to existing non-modularized E2E models, CAT performs better on limited-scale datasets, demonstrating its data efficiency and a new method called contextualized soft forgetting, which enables CAT to do streaming ASR without accuracy degradation.

CRF-based Single-stage Acoustic Modeling with CTC Topology

  • Hongyu XiangZhijian Ou
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
In a head-to-head comparison, the CTC-CRF model using simple Bidirectional LSTMs consistently outperforms the strong SS-LF-MMI, across all the three benchmarking datasets and in both cases of mono-phones and mono-chars.

Efficient Neural Architecture Search for End-to-End Speech Recognition Via Straight-Through Gradients

An efficient NAS method via Straight-Through (ST) gradients, called ST-NAS, which uses the loss from SNAS but uses ST to back-propagate gradients through discrete variables to optimize the loss, which is not revealed in ProxylessNAS.

How Phonotactics Affect Multilingual and Zero-Shot ASR Performance

  • Siyuan FengPiotr Żelasko N. Dehak
  • Linguistics, Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
It is shown that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer, and it is found that a mult bilingual LM hurts a multilingual ASR system’s performance, and retaining only the target language's phonotactic data in LM training is preferable.

Grapheme-to-Phoneme Transduction for Cross-Language ASR

A measure of the distance between the G2Ps in different languages is proposed, and agglomerative clustering of the LanguageNet languages bears some resemblance to a phylogeographic language family tree.