Differentiable Allophone Graphs for Language-Universal Speech Recognition

  title={Differentiable Allophone Graphs for Language-Universal Speech Recognition},
  author={Brian Yan and Siddharth Dalmia and David R. Mortensen and Florian Metze and Shinji Watanabe},
Building language-universal speech recognition systems entails producing phonological units of spoken sound that can be shared across languages. While speech annotations at the language-specific phoneme or surface levels are readily available, annotations at a universal phone level are relatively rare and difficult to produce. In this work, we present a general framework to derive phone-level supervision from only phonemic transcriptions and phone-to-phoneme mappings with learnable weights… Expand

Figures and Tables from this paper

Simple and Effective Zero-shot Cross-lingual Phoneme Recognition
This paper extends previous work on zero-shot cross-lingual transfer learning by fine-tuning a multilingually pretrained wav2vec 2.0 model to transcribe unseen languages by mapping phonemes of the training languages to the target language using articulatory features. Expand


Universal Phone Recognition with a Multilingual Allophone System
This work proposes a joint model of both language-independent phone and language-dependent phoneme distributions that can build a (nearly-)universal phone recognizer that, when combined with the PHOIBLE [1] large, manually curated database of phone inventories, can be customized into 2,000 language dependent recognizers. Expand
Towards Zero-shot Learning for Automatic Phonemic Transcription
This model is able to recognize unseen phonemes in the target language without any training data and achieves 7.7% better phoneme error rate on average over a standard multilingual model. Expand
Phoneme Level Language Models for Sequence Based Low Resource ASR
This paper proposes a phoneme-level language model that can be used multilingually and for crosslingual adaptation to a target language and shows that this model performs almost as well as the monolingual models, and is capable of better adaptation to languages not seen during training in a low resource scenario. Expand
AlloVera: A Multilingual Allophone Database
It is shown that a “universal” allophone model, Allosaurus, built with AlloVera, outperforms “ universal” phonemic models and language-specific models on a speech-transcription task. Expand
Multilingual phone models for vocabulary-independent speech recognition tasks
Three different methods to develop multilingual phone models for flexible speech recognition tasks are presented and a huge reduction of the number of densities in the multilingual system is observed. Expand
Unsupervised Phonetic and Word Level Discovery for Speech to Speech Translation for Unwritten Languages
The goal is to evaluate the translation performance not only of acoustically derived units but also of discovered sequences or “words” made from these units, with the intuition that such representations will encode more meaning than phones alone. Expand
Multi-lingual phoneme recognition exploiting acoustic-phonetic similarities of sounds
A statistical distance measure is introduced to determine the similarities of sounds and a new method of modelling multi-lingual phonemes, which can be used for a variety of languages, is presented, to exploit the acoustic-phonetic similarities between several languages. Expand
Sequence-Based Multi-Lingual Low Resource Speech Recognition
It is shown that end-to-end multi-lingual training of sequence models is effective on context independent models trained using Connectionist Temporal Classification (CTC) loss and can be adapted cross-lingually to an unseen language using just 25% of the target data. Expand
Tusom2021: A Phonetically Transcribed Speech Dataset from an Endangered Language for Universal Phone Recognition Experiments
A publicly available, phonetically transcribed corpus of 2255 utterances in the endangered Tangkhulic language East Tusom (no ISO 639-3 code), a Tibeto-Burman language variety spoken mostly in India is presented. Expand
Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models
An E2E model containing both English wordpieces and phonemes in the modeling space is proposed, and it is found that the proposed approach performs 16% better than a grapheme-only biasing model, and 8%better than a wordpiece-onlyBiasing model on a foreign place name recognition task, with only slight degradation on regular English tasks. Expand