Attention Is All You Need In Speech Separation

@article{Subakan2021AttentionIA,
  title={Attention Is All You Need In Speech Separation},
  author={Cem Subakan and Mirco Ravanelli and Samuele Cornell and Mirko Bronzi and Jianyuan Zhong},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={21-25}
}
  • Cem Subakan, M. Ravanelli, +2 authors Jianyuan Zhong
  • Published 25 October 2020
  • Computer Science, Engineering
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Recurrent Neural Networks (RNNs) have long been the dominant architecture in sequence-to-sequence learning. RNNs, however, are inherently sequential models that do not allow parallelization of their computations. Transformers are emerging as a natural alternative to standard RNNs, replacing recurrent computations with a multi-head attention mechanism.In this paper, we propose the SepFormer, a novel RNN-free Transformer-based neural network for speech separation. The Sep-Former learns short and… Expand

Figures and Tables from this paper

TransMask: A Compact and Fast Speech Separation Model Based on Transformer
TLDR
TransMask fully utilizes the parallelism during inference, and achieves nearly linear inference time within reasonable input audio lengths, and outperforms existing solutions on output speech audio quality, achieving SDR above 16 over Librimix benchmark. Expand
Monaural source separation: From anechoic to reverberant environments
TLDR
Taking the SepFormer as a starting point, the system is gradually modified to optimize its performance on reverberant mixtures, which leads to a word error rate improvement by 8 percentage points compared to the standard SepFormer implementation, but ends up with only marginally better performance than the improved PIT-BLSTM separation system. Expand
North America Bixby Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2021
This paper describes the submission to the speaker diarization track of VoxCeleb Speaker Recognition Challenge 2021 done by North America Bixby Lab of Samsung Research America. Our speakerExpand
SpeechBrain: A General-Purpose Speech Toolkit
TLDR
The core architecture of SpeechBrain is described, designed to support several tasks of common interest, allowing users to naturally conceive, compare and share novel speech processing pipelines. Expand
Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in High-order Latent Domain
TLDR
The proposed Stepwise-Refining Speech Separation Network (SRSSN) learns a new latent domain along each basis function of the existing latent domain to obtain a high-order latent domain in the refining phase, which enables the model to perform a refining separation to achieve a more precise speech separation. Expand
REAL-M: Towards Speech Separation on Real Mixtures
TLDR
The problem of performance evaluation of reallife mixtures, where the ground truth is not available is addressed by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator, and it is shown that this estimator reliably evaluates the separation performance on real mixtures. Expand
BLOOM-Net: Blockwise Optimization for Masking Networks Toward Scalable and Efficient Speech Enhancement
  • Sunwoo Kim, Minje Kim
  • Computer Science, Engineering
  • ArXiv
  • 2021
TLDR
This paper designs a network with a residual learning scheme and train the internal separator blocks sequentially to obtain a scalable masking-based deep neural network for speech enhancement, which achieves the desired scalability with only a slight performance degradation. Expand
BioCPPNet: Automatic Bioacoustic Source Separation with Deep Neural Networks
TLDR
This paper redefines the state-of-the-art in end-to-end single-channel bioacoustic source separation in a permutation-invariant regime across a heterogeneous set of non-human species. Expand
Compute and memory efficient universal sound source separation
TLDR
This study provides a family of efficient neural network architectures for general purpose audio source separation while focusing on multiple computational aspects that hinder the application of neural networks in real-world scenarios. Expand
Configurable Privacy-Preserving Automatic Speech Recognition
TLDR
It is shown that voice privacy can be configurable, and it is argued this presents new opportunities for privacy-preserving applications incorporating ASR. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 34 REFERENCES
End-To-End Multi-Speaker Speech Recognition With Transformer
TLDR
This work replaces the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture, and incorporates an external dereverberation preprocessing, the weighted prediction error (WPE), enabling the model to handle reverberated signals. Expand
Neural Speech Synthesis with Transformer Network
TLDR
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality. Expand
Light Gated Recurrent Units for Speech Recognition
TLDR
This paper revise one of the most popular RNN models, namely, gated recurrent units (GRUs), and proposes a simplified architecture that turned out to be very effective for ASR, and proposes to replace hyperbolic tangent with rectified linear unit activations. Expand
T-GSA: Transformer with Gaussian-Weighted Self-Attention for Speech Enhancement
TLDR
A Transformer with Gaussian-weighted self-attention (T-GSA), whose attention weights are attenuated according to the distance between target and context symbols, which has significantly improved speech-enhancement performance, compared to the Transformer and RNNs. Expand
A Comparative Study on Transformer vs RNN in Speech Applications
TLDR
An emergent sequence-to-sequence model called Transformer achieves state-of-the-art performance in neural machine translation and other natural language processing applications, including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. Expand
Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation
  • Yi Luo, Zhuo Chen, T. Yoshioka
  • Computer Science, Engineering
  • ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
Experiments show that by replacing 1-D CNN with DPRNN and apply sample-level modeling in the time-domain audio separation network (TasNet), a new state-of-the-art performance on WSJ0-2mix is achieved with a 20 times smaller model than the previous best system. Expand
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Multi-Task Self-Supervised Learning for Robust Speech Recognition
TLDR
PASE+ is proposed, an improved version of PASE that better learns short- and long-term speech dynamics with an efficient combination of recurrent and convolutional networks and learns transferable representations suitable for highly mismatched acoustic conditions. Expand
Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks
In this paper, we propose the utterance-level permutation invariant training (uPIT) technique. uPIT is a practically applicable, end-to-end, deep-learning-based solution for speaker independentExpand
FurcaNeXt: End-to-end monaural speech separation with dynamic gated dilated temporal convolutional networks
TLDR
This paper proposes several improvements of TCN for end-to-end approach to monaural speech separation, which consists of multi-scale dynamic weighted gated dilated Convolutional pyramids network (FurcaPy), gated TCN with intra-parallel convolutional components (furcaPa), and weight-shared multi- scale gatedTCN (F FurcaSh). Expand
...
1
2
3
4
...