Factorized Deep Neural Network Adaptation for Automatic Scoring of L2 Speech in English Speaking Tests

  title={Factorized Deep Neural Network Adaptation for Automatic Scoring of L2 Speech in English Speaking Tests},
  author={Dean Luo and Chunxiao Zhang and Linzhong Xia and Lixin Wang},
Speaker adaptation has been shown to be effective on speech recognition and evaluation of L2 speech. However, other factors, such as environments and foreign accents, can affect the speech signal in addition to speakers. Factorizing the speaker, environment and other acoustic factors is crucial in evaluating L2 speech to effectively reduce acoustic mismatch between train and test conditions. In this study, we investigate the effects of deep neural network factorized adaptation techniques on L2… 

Tables from this paper

Correlational Neural Network Based Feature Adaptation in L2 Mispronunciation Detection
  • Wenwei Dong, Yanlu Xie
  • Computer Science
    2019 International Conference on Asian Language Processing (IALP)
  • 2019
The mispronunciation detection accuracy of CorrNet based method has improved 3.19% over un-normalized Fbank feature and 1.74% over bottleneck feature in Japanese speaking Chinese corpus.
Automatic Pronunciation Evaluation in High-states English Speaking Tests Based on Deep Neural Network Models
Using posterior based sentimental features combined with supra-segmental prosodic features, the proposed pronunciation evaluation system can provide human expert level performance for high-states English speaking tests.
Unsupervised Pronunciation Fluency Scoring by infoGan
  • Wenwei Dong, Yanlu Xie, Binghuai Lin
  • Computer Science, Linguistics
    2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
  • 2019
This work proposes an unsupervised learning approach, where an infoGan model is constructed to infer latent speech codes, and then these codes are used to build a classifier that distinguishes native and foreign speech.
Towards Lightweight Applications: Asymmetric Enroll-Verify Structure for Speaker Verification
This paper has come up with an innovative asymmetric structure, which takes the large-scale ECAPA-TDNN model for enrollment and the small-scaleECAPA -TDNNLite model for verification for verification and reduces the EER to 2.31%.


Factorised Representations for Neural Network Adaptation to Diverse Acoustic Environments
Using i-vectors, it is demonstrated that it is possible to factorise speaker or environment information using multi-condition training with neural networks, and extract bottleneck features from networks trained to classify either speakers or environments.
Analysis and utilization of MLLR speaker adaptation technique for learners' pronunciation evaluation
Experimental results show that the proposed methods can better utilize MLLR adaptation and avoid over-adaptation and two novel methods are proposed to solve the adverse effects of M LLR adaption.
Speaker adaptation of neural network acoustic models using i-vectors
This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed.
Regularized-MLLR speaker adaptation for computer-assisted language learning system
A novel speaker adaptation technique, regularized-MLLR, for Computer Assisted Language Learning (CALL) systems that avoids the over-adaptation problem that erroneous pronunciations come to be judged as good pronunciation after conventional MLLR speaker adaptation.
A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL)
The experimental results show that the GOP estimated by averaged frame-level posteriors of “senones” correlate with human scores the best, and the new approach can improve the correlations relatively by 22.0% or 15.6%, at word or sentence levels, respectively.
Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors
This paper ports the idea of SAT to deep neural networks (DNNs), and proposes a framework to perform feature-space SAT for DNNs, using i-vectors as speaker representations and an adaptation neural network to derive speaker-normalized features.
Factored adaptation of speaker and environment using orthogonal subspace transforms
The proposed subspace-based acoustic factorization framework to transform-based adaptation in speech recognition provides a straightforward factor analysis framework while allows us to explicitly formulate the independence among the estimated factor transforms.
Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription
This work investigates the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective to reduce the word error rate for speaker-independent transcription of phone calls.
Improving deep neural network acoustic models using generalized maxout networks
This paper introduces two new types of generalized maxout units, which they are called p-norm and soft-maxout, and presents a method to control that instability during training when training unbounded-output nonlinearities.
Foreign accent matters most when timing is wrong
The study suggests that pitch errors affect the performance score, but not as significantly as do timing errors.