• Publications
  • Influence
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Looking to listen at the cocktail party
TLDR
A deep network-based model that incorporates both visual and auditory signals to solve a single speech signal from a mixture of sounds such as other speakers and background noise, showing clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech.
Learning the speech front-end with raw waveform CLDNNs
TLDR
It is shown that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech.
Speech denoising using nonnegative matrix factorization with priors
TLDR
A technique for denoising speech using nonnegative matrix factorization (NMF) in combination with statistical speech and noise models is presented and improvements in speech quality across a range of interfering noise types are shown.
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
TLDR
A novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker, by training two separate neural networks.
Speech acoustic modeling from raw multichannel waveforms
TLDR
A convolutional neural network - deep neural network (CNN-DNN) acoustic model which takes raw multichannel waveforms as input, and learns a similar feature representation through supervised training and outperforms a DNN that uses log-mel filterbank magnitude features under noisy and reverberant conditions.
Acoustic Modeling for Google Home
TLDR
The technical and system building advances made to the Google Home multichannel speech recognition system, which was launched in November 2016, result in a reduction of WER of 8-28% relative to the current production system.
Universal Sound Separation
TLDR
A dataset of mixtures containing arbitrary sounds is developed, and the best methods produce an improvement in scale-invariant signal-to-distortion ratio of over 13 dB for speech/non-speech separation and close to 10 dB for universal sound separation.
Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition
TLDR
This paper introduces a neural network architecture, which performs multichannel filtering in the first layer of the network, and shows that this network learns to be robust to varying target speaker direction of arrival, performing as well as a model that is given oracle knowledge of the true target Speaker direction.
Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition
TLDR
A neural network adaptive beamforming (NAB) technique that uses LSTM layers to predict time domain beamforming filter coefficients at each input frame and achieves a 12.7% relative improvement in WER over a single channel model.
...
...