Analyzing Deep CNN-Based Utterance Embeddings for Acoustic Model Adaptation
@article{Rownicka2018AnalyzingDC, title={Analyzing Deep CNN-Based Utterance Embeddings for Acoustic Model Adaptation}, author={Joanna Rownicka and Peter Bell and Steve Renals}, journal={2018 IEEE Spoken Language Technology Workshop (SLT)}, year={2018}, pages={235-241} }
We explore why deep convolutional neural networks (CNNs) with small two-dimensional kernels, primarily used for modeling spatial relations in images, are also effective in speech recognition. We analyze the representations learned by deep CNNs and compare them with deep neural network (DNN) representations and i-vectors, in the context of acoustic model adaptation. To explore whether interpretable information can be decoded from the learned representations we evaluate their ability to…
8 Citations
Embeddings for DNN Speaker Adaptive Training
- Computer Science2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2019
The performance for speaker recognition of a given representation is not correlated with its ASR performance; in fact, ability to capture more speech attributes than just speaker identity was the most important characteristic of the embed-dings for efficient DNN-SAT ASR.
Pretraining for End-to-end Utterance-level Language and Speaker Recognition
- Computer Science
- 2019
This work proposes contextual frame representations that capture phonetic information at the acoustic frame level and can be used for utterance-level language, speaker, and speech recognition.
Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition
- Computer ScienceArXiv
- 2019
This work proposes contextual frame representations that capture phonetic information at the acoustic frame level and can be used for utterance-level language, speaker, and speech recognition.
Contextual Phonetic Pre-training for End-to-end Utterance-level Language and Speaker Recognition
- Computer Science
- 2019
This work proposes contextual frame representations that capture phonetic information at the acoustic frame level and can be used for utterance-level language, speaker, and speech recognition.
Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis
- Computer ScienceOdyssey
- 2020
A key finding is the quality achieved for certain speakers seems consistent, regardless of the TTS or VC system, and the method provides an automatic way to identify such speakers.
Improving Singing Voice Separation Using Attribute-Aware Deep Network
- Computer Science2019 International Workshop on Multilayer Music Representation and Processing (MMRP)
- 2019
It is shown that the separation network informed of vocal activity learns to differentiate between vocal and nonvocal regions, and thus reduces interference and artifacts better compared to the network agnostic to this side information.
Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
- Computer ScienceIEEE Open Journal of Signal Processing
- 2021
A meta-analysis of the performance of speech recognition adaptation algorithms is presented, based on relative error rate reductions as reported in the literature, to characterize adaptation algorithms as based on embeddings, model parameter adaptation, or data augmentation.
Prediction of speech intelligibility with DNN-based performance measures
- Computer ScienceComput. Speech Lang.
- 2022
References
SHOWING 1-10 OF 25 REFERENCES
Advances in Very Deep Convolutional Neural Networks for LVCSR
- Computer ScienceINTERSPEECH
- 2016
This paper proposes a new CNN design without timepadding and without timepooling, which is slightly suboptimal for accuracy, but has two significant advantages: it enables sequence training and deployment by allowing efficient convolutional evaluation of full utterances, and, it allows for batch normalization to be straightforwardly adopted to CNNs on sequence data.
Very deep multilingual convolutional neural networks for LVCSR
- Computer Science2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2016
A very deep convolutional network architecture with up to 14 weight layers, with small 3×3 kernels, inspired by the VGG Imagenet 2014 architecture is introduced and multilingual CNNs with multiple untied layers are introduced.
X-Vectors: Robust DNN Embeddings for Speaker Recognition
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.
Recent progresses in deep learning based acoustic models
- Computer ScienceIEEE/CAA Journal of Automatica Sinica
- 2017
This paper describes models that are optimized end-to-end and emphasize on feature representations learned jointly with the rest of the system, the connectionist temporal classification U+0028 CTC U-0029 criterion, and the attention-based sequenceto-sequence translation model.
Exploring convolutional neural network structures and optimization techniques for speech recognition
- Computer ScienceINTERSPEECH
- 2013
This paper investigates several CNN architectures, including full and limited weight sharing, convolution along frequency and time axes, and stacking of several convolution layers, and develops a novel weighted softmax pooling layer so that the size in the pooled layer can be automatically learned.
Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention
- Computer ScienceINTERSPEECH
- 2016
A deep convolutional neural network with layer-wise context expansion and location-based attention, for large vocabulary speech recognition, and it is shown that this model outperforms both the DNN and LSTM significantly.
Convolutional Neural Networks for Speech Recognition
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2014
It is shown that further error rate reduction can be obtained by using convolutional neural networks (CNNs), and a limited-weight-sharing scheme is proposed that can better model speech features.
Simplifying very deep convolutional neural network architectures for robust speech recognition
- Computer Science2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
- 2017
A proposed model consisting solely of convolutional (conv) layers, and without any fully-connected layers, achieves a lower word error rate on Aurora 4 compared to other VDCNN architectures typically used in speech recognition.
Speaker adaptation of neural network acoustic models using i-vectors
- Computer Science2013 IEEE Workshop on Automatic Speech Recognition and Understanding
- 2013
This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed.
Time-frequency convolutional networks for robust speech recognition
- Computer Science2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)
- 2015
This work presents a modified CDNN architecture that is called the time-frequency convolutional network (TFCNN), in which two parallel layers of convolution are performed on the input feature space: convolution across time and frequency, each using a different pooling layer.