• Corpus ID: 235694278

What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis

  title={What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis},
  author={S. A. Chowdhury and Nadir Durrani and Ahmed M. Ali},
End-to-end deep neural network architectures have pushed the state-of-the-art in speech technologies, as well as in other spheres of Artificial Intelligence, subsequently leading researchers to train more complex and deeper models. These improvements came at the cost of transparency. Deep neural networks are innately opaque and difficult to interpret, compared to the traditional handcrafted feature-based models. We no longer understand what features are learned within these deep models, where… 
PARP: Prune, Adjust and Re-Prune for Self-Supervised Speech Recognition
This work proposes Prune-AdjustRe-Prune (PARP), which discovers and finetunes subnetworks for much better ASR performance, while only requiring a single downstream finetuning run, and demonstrates the computational advantage and performance gain of PARP over baseline pruning methods.


Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems
This work analyzes the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss and evaluates representations from different layers of the deep model.
Deep Language: a comprehensive deep learning approach to end-to-end language recognition
It is shown that an end-to-end deep learning system can be used to recognize language from speech utterances with various lengths and a combination of three deep architectures: feed-forward network, convolutional network and recurrent network can achieve the best performance compared to other network designs.
EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding
This paper presents the Eesen framework which drastically simplifies the existing pipeline to build state-of-the-art ASR systems and achieves comparable word error rates (WERs), while at the same time speeding up decoding significantly.
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.
Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model
A Convolutional Neural Network (CNN) based speaker recognition model for extracting robust speaker embeddings is proposed and it is found that the networks are better at discriminating broad phonetic classes than individual phonemes.
End-to-End Language Identification Using High-Order Utterance Representation with Bilinear Pooling
A novel network is proposed which aims to model an effective representation for high (first and second)-order statistics of LID-senones, defined as being LID analogues of senones in speech recognition.
What all do audio transformer models hear? Probing Acoustic Representations for Language Delivery and its Structure
This work compares the two recent audio transformer models, Mockingjay and wave2vec2.0, on a comprehensive set of language delivery and structure features including audio, fluency and pronunciation features and probes their understanding of textual surface, syntax, and semantic features.
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional
Deep Neural Network Embeddings for Text-Independent Speaker Verification
It is found that the embeddings outperform i-vectors for short speech segments and are competitive on long duration test conditions, which are the best results reported for speaker-discriminative neural networks when trained and tested on publicly available corpora.
Convolutional Neural Networks and Language Embeddings for End-to-End Dialect Recognition
This paper proposes an end-to-end DID system and a Siamese neural network to extract language embeddings, and investigates a dataset augmentation approach to achieve robust performance with limited data resources.