• Corpus ID: 207870571

Onssen: an open-source speech separation and enhancement library

  title={Onssen: an open-source speech separation and enhancement library},
  author={Zhaoheng Ni and Michael I. Mandel},
Speech separation is an essential task for multi-talker speech recognition. Recently many deep learning approaches are proposed and have been constantly refreshing the state-of-the-art performances. The lack of algorithm implementations limits researchers to use the same dataset for comparison. Building a generic platform can benefit researchers by easily implementing novel separation algorithms and comparing them with the existing ones on customized datasets. We introduce "onssen": an open… 

Figures and Tables from this paper

ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration
The design of the toolkit, several important functionalities, especially the speech recognition integration, which differentiates ESPnet-SE from other open source toolkits, and experimental results with major benchmark datasets are described.
Asteroid: the PyTorch-based audio source separation toolkit for researchers
The software architecture of Asteroid is described, which provides all neural building blocks required to build a neural source separation system, and it is shown that the implementations are at least on par with most results reported in reference papers.
Automatic Speech-Based Checklist for Medical Simulations
An autonomous and a fully automatic speech-based checklist system, capable of objectively identifying and validating anesthesia residents’ actions in a simulation environment, and developing an audio-based system will improve the experience of a wide range of simulation platforms.
User Experience Sensor for Man–Machine Interaction Modeled as an Analogy to the Tower of Hanoi
The authors present the optimization mechanism of the HINT system as an analogy to the process of building a Tower of Hanoi, and the proposed sensor evaluates the user experience and measures the user/employee efficiency at every stage of a given process.


TasNet: Surpassing Ideal Time-Frequency Masking for Speech Separation.
Improving Mask Learning Based Speech Enhancement System with Restoration Layers and Residual Connection
A novel residual learning based speech enhancement model via adding different shortcut connections to a feature mapping network is proposed and it is shown such a structure can benefit from both the mask learning and the feature mapping.
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2019
A fully convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time- domain speech separation, which significantly outperforms previous time–frequency masking methods in separating two- and three-speaker mixtures.
TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation
  • Yi Luo, N. Mesgarani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
Time-domain Audio Separation Network (TasNet) is proposed, which outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output.
Permutation invariant training of deep models for speaker-independent multi-talker speech separation
This work proposes a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, and finds that it compares favorably to non-negative matrix factorization (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages.
Deep clustering and conventional networks for music separation: Stronger together
It is shown that deep clustering outperforms conventional networks on a singing voice separation task, in both matched and mismatched conditions, even though conventional networks have the advantage of end-to-end training for best signal approximation.
End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction
This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its
FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks
This paper proposes several improvements of TCN for end-to-end approach to monaural speech separation, which consists of multi-scale dynamic weighted gated dilated Convolutional pyramids network (FurcaPy), gated TCN with intra-parallel convolutional components (furcaPa), and weight-shared multi- scale gatedTCN (F FurcaSh).
Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation
  • Yi Luo, Zhuo Chen, T. Yoshioka
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
Experiments show that by replacing 1-D CNN with DPRNN and apply sample-level modeling in the time-domain audio separation network (TasNet), a new state-of-the-art performance on WSJ0-2mix is achieved with a 20 times smaller model than the previous best system.
Deep clustering: Discriminative embeddings for segmentation and separation
Preliminary experiments on single-channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker mixtures can improve signal quality for mixtures of held-out speakers by an average of 6dB, and the same model does surprisingly well with three-speakers mixtures.