• Publications
  • Influence
An Unsupervised Autoregressive Model for Speech Representation Learning
TLDR
Speech representations learned by the proposed unsupervised autoregressive neural model significantly improve performance on both phone classification and speaker verification over the surface features and other supervised and unsuper supervised approaches.
AST: Audio Spectrogram Transformer
TLDR
The Audio Spectrogram Transformer (AST) is introduced, the first convolution-free, purely attention-based model for audio classification, which achieves new state-of-the-art results on various audio classification benchmarks.
Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech
TLDR
The proposed Speech2Vec model, a novel deep neural network architecture for learning fixed-length vector representations of audio segments excised from a speech corpus, is based on a RNN Encoder-Decoder framework, and borrows the methodology of skipgrams or continuous bag-of-words for training.
Vector-Quantized Autoregressive Predictive Coding
TLDR
This work proposes Vector-Quantized Autoregressive Predictive Coding (VQ-APC), a novel model that produces quantized representations, allowing us to explicitly control the amount of information encoded in the representations, and finds that there exists a point where phonetic and speaker information are amplified to maximize a self-supervised objective.
Generative Pre-Training for Speech with Autoregressive Predictive Coding
TLDR
This paper proposes to use autoregressive predictive coding (APC), a recently proposed self-supervised objective, as a generative pre-training approach for learning meaningful, non-specific, and transferable speech representations.
Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder
TLDR
This paper proposes unsupervised learning of Audio Word2Vec from audio data without human annotation using Sequence-to-sequence Audoencoder (SA), which significantly outperformed the conventional Dynamic Time Warping (DTW) based approaches at significantly lower computation requirements.
Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies
TLDR
Non-Autoregressive Predictive Coding (NPC), a self-supervised method, to learn a speech representation in a non-autoregressive manner by relying only on local dependencies of speech is proposed.
PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation
TLDR
PSLA is presented, a collection of model agnostic training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation, and model aggregation.
Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization
TLDR
Experimental results demonstrate that the proposed method can disentangle speaker and noise attributes even if they are correlated in the training data, and can be used to consistently synthesize clean speech for all speakers.
Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis
TLDR
A semi-supervised training framework is proposed to improve the data efficiency of Tacotron and allow it to utilize textual and acoustic knowledge contained in large, publicly-available text and speech corpora.
...
...