• Publications
  • Influence
Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data
A factorized hierarchical variational autoencoder, which learns disentangled and interpretable representations from sequential data without supervision by formulating it explicitly within a factorsized hierarchical graphical model that imposes sequence-dependent priors and sequence-independent priors to different sets of latent variables. Expand
An Unsupervised Autoregressive Model for Speech Representation Learning
Speech representations learned by the proposed unsupervised autoregressive neural model significantly improve performance on both phone classification and speaker verification over the surface features and other supervised and unsuper supervised approaches. Expand
Hierarchical Generative Modeling for Controllable Speech Synthesis
A high-quality controllable TTS model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions is proposed. Expand
Active Learning by Learning
A learning algorithm is designed that connects active learning with the well-known multi-armed bandit problem and postulates that, given an appropriate choice for the multi-arm bandit learner, it is possible to estimate the performance of different strategies on the fly. Expand
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the Framework. Expand
Learning Latent Representations for Speech Generation and Transformation
The capability of the convolutional VAE model to modify the phonetic content or the speaker identity for speech segments using the derived operations, without the need for parallel supervisory data is demonstrated. Expand
Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis
A semi-supervised training framework is proposed to improve the data efficiency of Tacotron and allow it to utilize textual and acoustic knowledge contained in large, publicly-available text and speech corpora. Expand
Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization
Experimental results demonstrate that the proposed method can disentangle speaker and noise attributes even if they are correlated in the training data, and can be used to consistently synthesize clean speech for all speakers. Expand
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method isExpand
A prioritized grid long short-term memory RNN for speech recognition
This paper extends stacked long short-term memory (LSTM) RNNs by using grid LSTM blocks that formulate computation along not only the temporal dimension, but also the depth dimension, in order to alleviate vanishing gradient problems. Expand