Improving DNN-Based Automatic Recognition of Non-native Children Speech with Adult Speech

  title={Improving DNN-Based Automatic Recognition of Non-native Children Speech with Adult Speech},
  author={Yao Qian and Xinhao Wang and Keelan Evanini and David Suendermann-Oeft},
Acoustic models for state-of-the-art DNN-based speech recognition systems are typically trained using at least several hundred hours of task-specific training data. However, this amount of training data is not always available for some applications. In this paper, we investigate how to use an adult speech corpus to improve DNN-based automatic speech recognition for non-native children's speech. Although there are many acoustic and linguistic mismatches between the speech of adults and children… 

Tables from this paper

Bidirectional LSTM-RNN for Improving Automated Assessment of Non-Native Children's Speech

Different neural network architectures for improving non-native children’s speech recognition and the impact of the features extracted from the corresponding ASR output on the automated assessment of speaking proficiency are investigated.

GANs for Children: A Generative Data Augmentation Strategy for Children Speech Recognition

The results show that more than relative 20% WER reduction can be obtained on children speech testset with the proposed method, and the generated children speech with GAN even can improve the adults' speech within the authors' experimental setups.

Multi-Scale Context Adaptation for Improving Child Automatic Speech Recognition in Child-Adult Spoken Interactions

This paper considers the task of automatic recognition for children’s speech, in the context of child-adult spoken interactions during interviews of children suspected to have been maltreated, and demonstrates improvement in child speech recognition accuracy by conditioning on both the domain and the interlocutor's (adult) speech.

The SLT 2021 Children Speech Recognition Challenge: Open Datasets, Rules and Baselines

The Children Speech Recognition Challenge (CSRC) is launched, as a flagship satellite event of IEEE SLT 2021 workshop, and the datasets, rules, evaluation method as well as baselines are introduced.

A Prompt-Aware Neural Network Approach to Content-Based Scoring of Non-Native Spontaneous Speech

A neural network approach to the automated assessment of non-native spontaneous speech in a listen and speak task that performs as well as the strong baseline of a Support Vector Regressor using content-related features, without doing any feature engineering.

Detection of Consonant Errors in Disordered Speech Based on Consonant-vowel Segment Embedding

Experimental results show that using CV segments achieves improved performance on detecting speech errors concerning those “difficult” consonants reported in the previous studies.

Samrómur Children: An Icelandic Speech Corpus

The corpus was developed within the framework of the “Language Technology Programme for Icelandic 2019 − 2023 ”; the goal of the project is to make Icelandic available in language-technology applications.



Self-Adaptive DNN for Improving Spoken Language Proficiency Assessment

Experimental results show that self-adaptive DNNs trained with i-vectors can reduce absolute word error rate by 11.7% and deliver more accurate recognized word sequences for language proficiency assessment and the correlations between automated scoring and expert scoring could be increased.

Vocal tract length normalisation approaches to DNN-based children's and adults' speech recognition

Both VTLN-based approaches are shown to improve phone error rate performance, up to 20% relative improvement, compared to a baseline trained on a mixture of children's and adults' speech.

Using deep neural networks to improve proficiency assessment for children English language learners

A DNN-based speech recognition system, built using rectified linear units (ReLU), greatly outperformed recognition accuracy of Gaussian mixture models (GMM)-HMMs, even when the latter models were trained with eight times more data.

I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription

This paper shows how this i-vector based speaker adaptation can be used to perform blind speaker adaptation of hybrid DNN-HMM speech recognition system and reports excellent results on a French language audio transcription task.

Automatic speech recognition for children

A simple speaker normalization algorithm combining frequency warping and spectral shaping introduced in [5] is shown to reduce acoustic variability and improve recognition performance for children speakers, and age-dependent acoustic modeling further reduces word error rate.

Child automatic speech recognition for US English: child interaction with living-room-electronic-devices

This study shows that using a minimal amount of data, multiple components of a state-of-the-art adult centric large vocabulary continuous speech recognition system are adapted to form a child specific LVCSR system, which improves the accuracy for children speaking US English to living room electronic devices (LRED), e.g. a voice-operated TV or computer.

Large vocabulary automatic speech recognition for children

This paper describes the use of a neural network classifier to identify matched acoustic training data, filtering data for language modeling to reduce the chance of producing offensive results, and compares long short-term memory recurrent networks to convolutional, LSTM, deep neural networks.

Deep neural network acoustic models for spoken assessment applications

Roles of Pre-Training and Fine-Tuning in Context-Dependent DBN-HMMs for Real-World Speech Recognition

It is shown that pre-training can initialize weights to a point in the space where fine-tuning can be effective and thus is crucial in training deep structured models and in the recognition performance of a CD-DBN-HMM based large-vocabulary speech recognizer.

Speaker adaptation of neural network acoustic models using i-vectors

This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed.