Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition

  title={Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition},
  author={Mengzhe Geng and Shansong Liu and Jianwei Yu and Xurong Xie and Shoukang Hu and Zi Ye and Zengrui Jin and Xunying Liu and Helen M. Meng},
Automatic recognition of disordered speech remains a highly challenging task to date. Sources of variability commonly found in normal speech including accent, age or gender, when further compounded with the underlying causes of speech impairment and varying severity levels, create large diversity among speakers. To this end, speaker adaptation techniques play a vital role in current speech recognition systems. Motivated by the spectro-temporal level differences between disordered and normal… 

Figures and Tables from this paper

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition
Novel spectrotemporal subspace basis deep embedding features derived using SVD speech spectrum decomposition are proposed in this paper to facilitate auxiliary feature based speaker adaptation of stateof-the-art hybrid DNN/TDNN and end-to-end Conformer speech recognition systems.
On-the-fly Feature Based Speaker Adaptation for Dysarthric and Elderly Speech Recognition
Experiments conducted on the UASpeech dysarthric and DimentiaBank Pitt elderly speech datasets suggest the proposed SBEVR features based adaptation statistically outperform both the baseline on-the-fiy i-Vector adapted hybrid TDNN/DNN systems and offline batch mode model based LHUC adaptation using all speaker-level data.
Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic- articulatory data of the 15-hour TORGO corpus in model training before being cross- domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition
The proposed GAN based data augmentation approaches consistently outperform the baseline speed perturbation method by up to 0.91% and 3.0% absolute on the TORGO and DementiaBank data respectively.
Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition
This paper presents a cross-domain and crosslingual A2A inversion approach that utilizes the parallel audio, visual and ultrasound tongue imaging (UTI) data of the 24hour TaL corpus in A1A model pre-training before being cross domain and cross-lingual adapted to three datasets across two languages to produce UTI based articulatory features.
An Auditory Saliency Pooling-Based LSTM Model for Speech Intelligibility Classification
Results show that all the systems with saliency pooling significantly outperform a reference support vector machine (SVM)-based system and LSTM-based systems with mean pooling and attention pooling, suggesting that Kalinli’s saliency can be successfully incorporated into the L STM architecture as an external cue for the estimation of the speech intelligibility level.


Investigation of Data Augmentation Techniques for Disordered Speech Recognition
A set of data augmentation techniques for disordered speech recognition, including vocal tract length perturbation (VTLP), tempo perturbators and speed perturbations, are investigated, finding variations among impaired speakers in both the original and augmented data were exploited.
Regularized Speaker Adaptation of KL-HMM for Dysarthric Speech Recognition
A speaker adaptation method based on a combination of L2 regularization and confusion-reducing regularization, which can enhance discriminability between categorical distributions of the KL-HMM states while preserving speaker-specific information is proposed.
Recognition of Dysarthric Speech Using Voice Parameters for Speaker Adaptation and Multi-Taper Spectral Estimation
This paper examines the applicability of voice parameters that are traditionally used for pathological voice classification such as jitter, shimmer, F0 and Noise Harmonic Ratio (NHR) contour in addition to Mel Frequency Cepstral Coefficients (MFCC) for dysarthric speech recognition.
Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition
Experiments conducted on the UASpeech corpus suggest that the proposed cross-domain visual feature generation based AVSR system consistently outperformed the baseline ASR system andAVSR system using original visual features.
Phonetic Analysis of Dysarthric Speech Tempo and Applications to Robust Personalised Dysarthric Speech Recognition
An approach that non-linearly modifies speech tempo to reduce mismatch between typical and atypical speech is explored, resulting in a nearly 7% absolute improvement in comparison to baseline speaker-dependent trained system evaluated using UASpeech corpus.
A comparative study of adaptive, automatic recognition of disordered speech
This study investigates how far fundamental training and adaptation techniques developed in the LVCSR community can take, and a variety of ASR systems using maximum likelihood and MAP adaptation strategies are established with all speakers obtaining significant improvements compared to the baseline system regardless of the severity of their condition.
Data Augmentation Using Healthy Speech for Dysarthric Speech Recognition
Data augmentation using temporal and speed modifications to healthy speech to simulate dysarthric speech is explored using tempo based and speed based data augmentation respec-tively as compared to ASR performance using healthy speech alone for training.
Development of the CUHK Dysarthric Speech Recognition System for the UA Speech Corpus
This paper presents the development of the Chinese University of Hong Kong automatic speech recognition (ASR) system for the Universal Access Speech (UASpeech) and a range of deep neural network (DNN) acoustic models and their more advanced variants based on time delayed neural networks (TDNNs) and long short-term memory recurrent Neural networks (LSTM-RNNs).
Model adaptation and adaptive training for the recognition of dysarthric speech
A statistical analysis performed across various systems and its specific implementation in modelling different dysarthric severity sub-groups showed that, SAT-adapted systems were more applicable to handle disfluencies of more severe speech and SI systems prepared from typical speech were more apt for modelling speech with low level of severity.
Dysarthric speech recognition using dysarthria-severity-dependent and speaker-adaptive models
Evaluation of the proposed speaker adaptation scheme showed that the proposed approach provides substantial improvement over the conventional speaker-adaptive system when a small amount of adaptation data is available.