Analyzing Deep CNN-Based Utterance Embeddings for Acoustic Model Adaptation

@article{Rownicka2018AnalyzingDC,
  title={Analyzing Deep CNN-Based Utterance Embeddings for Acoustic Model Adaptation},
  author={Joanna Rownicka and Peter Bell and Steve Renals},
  journal={2018 IEEE Spoken Language Technology Workshop (SLT)},
  year={2018},
  pages={235-241}
}
We explore why deep convolutional neural networks (CNNs) with small two-dimensional kernels, primarily used for modeling spatial relations in images, are also effective in speech recognition. We analyze the representations learned by deep CNNs and compare them with deep neural network (DNN) representations and i-vectors, in the context of acoustic model adaptation. To explore whether interpretable information can be decoded from the learned representations we evaluate their ability to… 

Figures and Tables from this paper

Embeddings for DNN Speaker Adaptive Training
TLDR
The performance for speaker recognition of a given representation is not correlated with its ASR performance; in fact, ability to capture more speech attributes than just speaker identity was the most important characteristic of the embed-dings for efficient DNN-SAT ASR.
Pretraining for End-to-end Utterance-level Language and Speaker Recognition
TLDR
This work proposes contextual frame representations that capture phonetic information at the acoustic frame level and can be used for utterance-level language, speaker, and speech recognition.
Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition
TLDR
This work proposes contextual frame representations that capture phonetic information at the acoustic frame level and can be used for utterance-level language, speaker, and speech recognition.
Contextual Phonetic Pre-training for End-to-end Utterance-level Language and Speaker Recognition
TLDR
This work proposes contextual frame representations that capture phonetic information at the acoustic frame level and can be used for utterance-level language, speaker, and speech recognition.
Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis
TLDR
A key finding is the quality achieved for certain speakers seems consistent, regardless of the TTS or VC system, and the method provides an automatic way to identify such speakers.
Improving Singing Voice Separation Using Attribute-Aware Deep Network
TLDR
It is shown that the separation network informed of vocal activity learns to differentiate between vocal and nonvocal regions, and thus reduces interference and artifacts better compared to the network agnostic to this side information.
Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
TLDR
A meta-analysis of the performance of speech recognition adaptation algorithms is presented, based on relative error rate reductions as reported in the literature, to characterize adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation.

References

SHOWING 1-10 OF 25 REFERENCES
Advances in Very Deep Convolutional Neural Networks for LVCSR
TLDR
This paper proposes a new CNN design without timepadding and without timepooling, which is slightly suboptimal for accuracy, but has two significant advantages: it enables sequence training and deployment by allowing efficient convolutional evaluation of full utterances, and, it allows for batch normalization to be straightforwardly adopted to CNNs on sequence data.
Very deep multilingual convolutional neural networks for LVCSR
TLDR
A very deep convolutional network architecture with up to 14 weight layers, with small 3×3 kernels, inspired by the VGG Imagenet 2014 architecture is introduced and multilingual CNNs with multiple untied layers are introduced.
X-Vectors: Robust DNN Embeddings for Speaker Recognition
TLDR
This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.
Recent progresses in deep learning based acoustic models
TLDR
This paper describes models that are optimized end-to-end and emphasize on feature representations learned jointly with the rest of the system, the connectionist temporal classification U+0028 CTC U-0029 criterion, and the attention-based sequenceto-sequence translation model.
Exploring convolutional neural network structures and optimization techniques for speech recognition
TLDR
This paper investigates several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers, and develops a novel weighted softmax pooling layer so that the size in the pooled layer can be automatically learned.
Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention
TLDR
A deep convolutional neural network with layer-wise context expansion and location-based attention, for large vocabulary speech recognition, and it is shown that this model outperforms both the DNN and LSTM significantly.
Convolutional Neural Networks for Speech Recognition
TLDR
It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.
Simplifying very deep convolutional neural network architectures for robust speech recognition
TLDR
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.
Speaker adaptation of neural network acoustic models using i-vectors
TLDR
This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed.
Time-frequency convolutional networks for robust speech recognition
  • V. Mitra, H. Franco
  • Computer Science
    2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
  • 2015
TLDR
This work presents a modified CDNN architecture that is called the time-frequency convolutional network (TFCNN), in which two parallel layers of convolution are performed on the input feature space: convolution across time and frequency, each using a different pooling layer.
...
...