Recurrent neural networks for polyphonic sound event detection in real life recordings

  title={Recurrent neural networks for polyphonic sound event detection in real life recordings},
  author={Giambattista Parascandolo and Heikki Huttunen and Tuomas Virtanen},
  journal={2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs). A single multilabel BLSTM RNN is trained to map acoustic features of a mixture signal consisting of sounds from multiple classes, to binary activity indicators of each event class. Our method is tested on a large database of real-life recordings, with 61 classes (e.g. music, car, speech) from 10 different… 

Figures and Tables from this paper

A polyphonic sound event detection (SED) system based on a multi-model system that uses one model based on Deep Neural Networks (DNN) to detect sound events of car, and five models based on Bi-directional Gated Recurrent Units Recurrent Neural Networks and BGRU-RNN to detect other sound events.
Bidirectional LSTM-HMM Hybrid System for Polyphonic Sound Event Detection
The hybrid model of neural network and HMM, which achieved state-of-the-art performance in the field of speech recognition, is extended to the multi-label classification problem and provides an explicit duration model for output labels, unlike the straightforward application of BLSTM-RNN.
Duration-Controlled LSTM for Polyphonic Sound Event Detection
This paper builds upon a state-of-the-art SED method that performs frame-by-frame detection using a bidirectional LSTM recurrent neural network, and incorporates a duration-controlled modeling technique based on a hidden semi-Markov model that makes it possible to model the duration of each sound event precisely and to perform sequence- by-sequence detection without having to resort to thresholding.
Using Sequential Information in Polyphonic Sound Event Detection
This paper proposes to use delayed predictions of event activities as additional input features that are fed back to the neural network, build N-grams to model the co-occurrence probabilities of different events, and use se-quentialloss to train neural networks.
BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic Sound Event Detection
This paper presents a new hybrid approach for polyphonic Sound Event Detection (SED) which incorporates a temporal structure modeling technique based on a hidden Markov model (HMM) with a
Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks
The proposed system using combination of 1D convolutional neural network and recurrent neural network (RNN) with long shortterm memory units (LSTM) has achieved the 1st place in the challenge with an error rate of 0.13 and an F-Score of 93.1.
Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
This work combines these two approaches in a convolutional recurrent neural network (CRNN) and applies it on a polyphonic sound event detection task and observes a considerable improvement for four different datasets consisting of everyday sound events.
A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification
This paper training two variants of SoundNet, a deep convolutional network that takes the audio tracks of videos as the input, and tries to approximate the visual information extracted by an image recognition network, to introduce knowledge learned from a much larger corpus into the CTC network.
Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019)
The proposed SED system is compared against the state of the art mono channel method on the development subset of TUT sound events detection 2016 database and the usage of spatial and harmonic features are shown to improve the performance of SED.
A Deep Neural Network-Driven Feature Learning Method for Polyphonic Acoustic Event Detection from Real-Life Recordings
A Deep Neural Network (DNN)-driven feature learning method for polyphonic Acoustic Event Detection (AED) is proposed that outperforms the state-of-the-art methods.


Polyphonic sound event detection using multi label deep neural networks
Frame-wise spectral-domain features are used as inputs to train a deep neural network for multi label classification in this work and the proposed method improves the accuracy by 19% percentage points overall.
Polyphonic piano note transcription with recurrent neural networks
  • Sebastian Böck, M. Schedl
  • Computer Science
    2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2012
A new approach for polyphonic piano note onset transcription based on a recurrent neural network to simultaneously detect the onsets and the pitches of the notes from spectral features and generalizes much better than existing systems.
Universal Onset Detection with Bidirectional Long Short-Term Memory Neural Networks
This paper presents a new onset detector with superior performance and temporal precision for all kinds of music, including complex music mixes, based on auditory spectral features and relative spectral differences processed by a bidirectional Long Short-Term Memory recurrent neural network, which acts as reduction function.
Acoustic event detection in real life recordings
A system for acoustic event detection in recordings from real life environments using a network of hidden Markov models, capable of recognizing almost one third of the events, and the temporal positioning of the Events is not correct for 84% of the time.
Speech recognition with deep recurrent neural networks
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations
A method that bypasses the supervised construction of class models is presented, which learns the components as a non-negative dictionary in a coupled matrix factorization problem, where the spectral representation and the class activity annotation of the audio signal share the activation matrix.
Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks
A range of label-preserving audio transformations are applied and pitch shifting is found to be the most helpful augmentation method for music data augmentation, reaching the state of the art on two public datasets.
Context-dependent sound event detection
The two-step approach was found to improve the results substantially compared to the context-independent baseline system, and the detection accuracy can be almost doubled by using the proposed context-dependent event detection.
Real-Time Detection of Overlapping Sound Events with Non-Negative Matrix Factorization
Two provably convergent algorithms are proposed and compared that address the problem of real-time detection of overlapping sound events by employing non-negative matrix factorization techniques and can improve detection in multi-source detection tasks of polyphonic music transcription, drum transcription and environmental sound recognition.