• Publications
  • Influence
Exploring Practical Aspects of Neural Mask-Based Beamforming for Far-Field Speech Recognition
TLDR
This work examines acoustic beamformers employing neural networks for mask prediction as front -end for automatic speech recognition (ASR) systems for practical scenarios like voice-enabled home devices and investigates different approaches for realizing online, or adaptive, NN-based beamforming.
Unsupervised Sound Separation Using Mixture Invariant Training
TLDR
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Unsupervised Sound Separation Using Mixtures of Mixtures
TLDR
This paper proposes a completely unsupervised method, mixture invariant training (MixIT), that requires only single-channel acoustic mixtures and shows that MixIT can achieve competitive performance compared to supervised methods on speech separation.
Integration of Speech Separation, Diarization, and Recognition for Multi-Speaker Meetings: System Description, Comparison, and Analysis
TLDR
Experiments indicate that the problem of overlapping speech for diarization and ASR can be effectively mitigated with the presence of a well-trained separation module, and an end-to-end modular system for the LibriCSS meeting data is proposed.
Improving Sound Event Detection in Domestic Environments using Sound Separation
TLDR
This paper starts from a sound separation model trained on the Free Universal Sound Separation dataset and the DCASE 2020 task 4 sound event detection baseline, and explores different methods to combine separated sound sources and the original mixture within the sound event Detection.
What’s all the Fuss about Free Universal Sound Separation Data?
TLDR
An open-source baseline separation model that can separate a variable number of sources in a mixture is introduced, based on an improved time-domain convolutional network (TDCN++), that achieves scale-invariant signal-to-noise ratio improvement (SI-SNRi) on mixtures with two to four sources.
Performance Study of a Convolutional Time-Domain Audio Separation Network for Real-Time Speech Denoising
TLDR
It is shown that a large part of the increase in performance between a causal and non-causal model is achieved with a lookahead of only 20 milliseconds, demonstrating the usefulness of even small lookaheads for many real-time applications.
Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation
TLDR
This paper introduces new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs to combat over-separation in mixture invariant training.
Image noise level estimation based on higher-order statistics
TLDR
A model-based technique for additive white Gaussian noise level estimation is proposed via matching moments of eligible transform coefficients of a single image via matching the estimated and true values of moments.
Sequential Multi-Frame Neural Beamforming for Speech Separation and Enhancement
TLDR
This work introduces sequential neural beamforming, which alternates between neural network based spectral separation and beamforming based spatial separation, and introduces a multi-frame beamforming method which improves the results significantly by adding contextual frames to the beamforming formulations.
...
...