Language ID Prediction from Speech Using Self-Attentive Pooling

  title={Language ID Prediction from Speech Using Self-Attentive Pooling},
  author={Roman Bedyakin and N. Mikhaylovskiy},
This memo describes NTR-TSU submission for SIGTYP 2021 Shared Task on predicting language IDs from speech. Spoken Language Identification (LID) is an important step in a multilingual Automated Speech Recognition (ASR) system pipeline. For many low-resource and endangered languages, only single-speaker recordings may be available, demanding a need for domain and speaker-invariant language ID systems. In this memo, we show that a convolutional neural network with a Self-Attentive Pooling layer… 

Figures and Tables from this paper


Attention-Based Models for Text-Dependent Speaker Verification
This paper analyzes the usage of attention mechanisms to the problem of sequence summarization in the authors' end-to-end text-dependent speaker recognition system and shows that attention-based models can improves the Equal Error Rate (EER) of the speaker verification system by relatively 14% compared to their non-attention LSTM baseline model.
Triplet Entropy Loss: Improving The Generalisation of Short Speech Language Identification Systems
It was found that all three methods improved the generalisation of the models, though not significantly, and the research shows that Triplet Entropy Loss has great potential and should be investigated further, not only in language identification tasks but any classification task.
Using closely-related language to build an ASR for a very under-resourced language: Iban
Using out-of-language data as source language provided lower WER when Iban data is very imited, and conducted experiments on cross-lingual ASR by using subspace Gaussian Mixture Models (SGMM) where the shared parameters obtained in either monolingual or multilingual fashion.
CMU Wilderness Multilingual
  • 2019
Language Identification Using Deep Convolutional Recurrent Neural Networks
In extensive experiments, it is shown, that the proposed LID system is applicable to a range of noisy scenarios and can easily be extended to previously unknown languages, while maintaining its classification accuracy.
Deep Speaker Embeddings
  • 2017
Automatic Spoken Language 143
  • 2004
Identification, Multilingual Speech Processing
    Semi-Supervised G2p
    • 2014
    ‘ Semi - Supervised G 2 p 67 Bootstrapping And Its Application to ASR for a 68 Very Under - Resourced Language : Iban ’ Grenoble , 69 France