iVector-based discriminative adaptation for automatic speech recognition

  title={iVector-based discriminative adaptation for automatic speech recognition},
  author={Martin Karafi{\'a}t and Luk{\'a}{\vs} Burget and Pavel Matejka and Ondrej Glembek and Jan Honza {\vC}ernock{\'y}},
  journal={2011 IEEE Workshop on Automatic Speech Recognition \& Understanding},
We presented a novel technique for discriminative feature-level adaptation of automatic speech recognition system. [] Key Method To utilized iVectors for adaptation, Region Dependent Linear Transforms (RDLT) are discriminatively trained using MPE criterion on large amount of annotated data to extract the relevant information from iVectors and to compensate speech feature. The approach was tested on standard CTS data. We found it to be complementary to common adaptation techniques. On a well tuned RDLT system…

Figures and Tables from this paper

Region dependent linear transforms in multilingual speech recognition

This work experiments with three popular transforms: HLDA, MPE-HLDA and Region Dependent Linear Transforms (RDLT), which are trained jointly with the acoustic model to extract maximum of the discriminative information from the raw features and to represent it in a form suitable for the following GMM-HMM based acoustic model.

I-vector dependent feature space transformations for adaptive speech recognition

The proposed approach is more practical for real-application speech recognition tasks since it eliminates the time-consuming adaptive training process to estimate the transformation matrix in feature-space discriminative linear regression (fDLR).

Analysis of X-Vectors for Low-Resource Speech Recognition

A study of usability of x-vectors for adaptation of automatic speech recognition (ASR) systems and over 1% absolute improvement was observed with x-VEctors over traditional i-vector, even when the x-vector extractor was not trained on target Pashto data.

A comparative study of fMPE and RDLT approaches to LVCSR

A specific RDLT approach is identified and recommended for deployment in LVCSR applications and recognition accuracy and run-time efficiency of different variants of the above two methods are evaluated.

Transfer learning for automatic speech recognition systems

The experiments show that for all target training sizes, the transferred models outperformed the models that are only trained on the target data, and the model that is transferred using 20 hours of target data achieved 7.8% higher recognition accuracy than the source model.

Speaker adaptation of neural network acoustic models using i-vectors

This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed.

Tied-state based discriminative training of context-expanded region-dependent feature transforms for LVCSR

A new discriminative feature transform approach to large vocabulary continuous speech recognition (LVCSR) using Gaussian mixture density hidden Markov models (GMM-HMMs) for acoustic modeling achieves a relative word error rate reduction of 10% and 6% respectively on Switchboard-1 conversational telephone speech transcription task.

A Unified Speaker Adaptation Approach for ASR

This work proposes a unified speaker adaptation approach consisting of feature adaptation and model adaptation, and employs a speaker-aware persistent memory model which generalizes better to unseen test speakers by making use of speaker i-vectors to form a persistent memory.

Lipreading with LipsID

Results from experiments with the LipNet network are presented by re-implementing the system and comparing it with and without LipsID features, showing a promising path for future experiments and other systems.

Learning representations for speech recognition using artificial neural networks

This thesis proposed an ANN with a structured output layer which models both context–dependent and context–independent speech units, with the context-independent predictions used at runtime to aid the prediction of context-dependent states.



Recent progress on the discriminative region-dependent transform for speech feature extraction

The integration of RDT with SAT yields 7% relative improvement in word error rate (WER), and theoretical comparisons are made between RDT and other discriminative feature extraction methods, including the improved version of the feature-space MPE (fMPE) that uses the “mean-offsets” as additional input features.

A compact model for speaker-adaptive training

A novel approach to estimating the parameters of continuous density HMMs for speaker-independent (SI) continuous speech recognition that jointly annihilates the inter-speaker variation and estimates the HMM parameters of the SI acoustic models.

Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation

Linear transforms have been used extensively for training and adaptation of HMM-based ASR systems. Recently procedures have been developed for the estimation of linear transforms under the Maximum

Minimum phoneme error based heteroscedastic linear discriminant analysis for speech recognition

  • Bing ZhangS. Matsoukas
  • Computer Science
    Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.
  • 2005
This technique, referred to as minimum phoneme error heteroscedastic linear discriminant analysis (MPE-HLDA), is shown to be more robust than traditional LDA methods in high dimensional spaces, and easy to incorporate with existing training procedures, such as HLDA-SAT and discriminative training of hidden Markov models (HMMs).

Improvements in MLLR-Transform-based Speaker Recognition

This paper reports recent improvements to the use of MLLR transforms derived from a speech recognition system as speaker features in a speaker verification system, which has about 27% lower decision cost than a state-of-the-art ccpstral GMM speaker system, and 53%Lower decision cost when trained on 8 conversation sides per speaker.

Discriminative linear transforms for speaker adaptation

This paper discusses the use of an alternative discriminative objective function for linear transform estimation, which is an interpolation of the maximum mutual information (MMI) objective function and the ML criterion, which more directly reduces the word error rate of the adaptation data than MLLR.

Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error

This article reports significant gains in recognition performance and model compactness as a result of discriminative training based on MCE training applied to HMMs, in the context of three challenging large-vocabulary speech recognition tasks.

Unsupervised discriminative adaptation using discriminative mapping transforms

A new framework for estimating transforms that are discriminative in nature, but are less sensitive to errors in the adaptation hypothesis is described, which significantly outperforms both standard ML and discriminatively trained transforms.

Discriminative speaker adaptation with conditional maximum likelihood linear regression

We present a simplified derivation of the extended Baum-Welch procedure, which shows that it can be used for Maximum Mutual Information (MMI) of a large class of continuous emission density hidden

Constrained discriminative mapping transforms for unsupervised speaker adaptation

Experimental results show that DMTs based on constrained linear transforms yield 3% to 6% relative gain over MLE transforms in unsupervised speaker adaptation.