Triphone State-Tying via Deep Canonical Correlation Analysis

  title={Triphone State-Tying via Deep Canonical Correlation Analysis},
  author={Weiran Wang and Hao Tang and Karen Livescu},
Context-dependent phone models are used in modern speech recognition systems to account for co-articulation effects. Due to the vast number of possible context-dependent phones, statetying is typically used to reduce the number of target classes for acoustic modeling. We propose a novel approach for state-tying which is completely data dependent and requires no domain knowledge. Our method first learns low-dimensional embeddings of context-dependent phones using deep canonical correlation… 

Figures and Tables from this paper

A Comparative Evaluation of GMM-Free State Tying Methods for ASR

Four refinements of the state tying algorithm to the HMM/DNN hybrid architecture were tested, and methods which utilized a decision criterion designed directly for neural networks consistently, and significantly, outperformed those which employed the standard Gaussian-based algorithm.

Social Signal Detection by Probabilistic Sampling DNN Training

Probabilistic sampling is a mathematically well-founded combination of upsampling and downsampling, which was found to outperform both of these simple resampling approaches and efficiently reduced the DNN training times.



Kernel CCA for multi-view learning of acoustic features using articulatory measurements

In phonetic frame classification experiments on data drawn from the University of Wisconsin X-ray Microbeam Database, it is found that KCCA provides consistent improvements over linear CCA, as well as over single-view unsupervised dimensionality reduction.

Unsupervised learning of acoustic features via deep canonical correlation analysis

This work uses the recently proposed deep CCA, where the functional form of the feature mapping is a deep neural network, and applies the approach on a speaker-independent phonetic recognition task using data from the University of Wisconsin X-ray Microbeam Database.

Deep Multilingual Correlation for Improved Word Embeddings

Deep non-linear transformations of word embeddings of the two languages are learned, using the recently proposed deep canonical correlation analysis, to improve their quality and consistency on multiple word and bigram similarity tasks.

On rectified linear units for speech processing

This work shows that it can improve generalization and make training of deep networks faster and simpler by substituting the logistic units with rectified linear units.

Low-rank matrix factorization for Deep Neural Network training with high-dimensional output targets

A low-rank matrix factorization of the final weight layer is proposed and applied to DNNs for both acoustic modeling and language modeling, showing an equivalent reduction in training time and a significant loss in final recognition accuracy compared to a full-rank representation.

Multi-view CCA-based acoustic features for phonetic recognition across speakers and domains

  • R. AroraKaren Livescu
  • Physics
    2013 IEEE International Conference on Acoustics, Speech and Signal Processing
  • 2013
The behavior of CCA-based acoustic features on the task of phonetic recognition is studied, and to what extent they are speaker-independent or domain-independent.

Extracting deep neural network bottleneck features using low-rank matrix factorization

This paper examines different SBN extraction architectures, and incorporates low-rank matrix factorization in the final weight layer to demonstrate the effectiveness of the SBN configurations when compared to state-of-the-art hybrid DNN approaches.

An Autoencoder Approach to Learning Bilingual Word Representations

This work explores the use of autoencoder-based methods for cross-language learning of vectorial word representations that are coherent between two languages, while not relying on word-level alignments, and achieves state-of-the-art performance.

The Kaldi Speech Recognition Toolkit

The design of Kaldi is described, a free, open-source toolkit for speech recognition research that provides a speech recognition system based on finite-state automata together with detailed documentation and a comprehensive set of scripts for building complete recognition systems.

Deep Canonical Correlation Analysis

DCCA is introduced, a method to learn complex nonlinear transformations of two views of data such that the resulting representations are highly linearly correlated and Parameters of both transformations are jointly learned to maximize the (regularized) total correlation.