Weakly-Supervised Sound Event Detection with Self-Attention

@article{Miyazaki2020WeaklySupervisedSE,
  title={Weakly-Supervised Sound Event Detection with Self-Attention},
  author={Koichi Miyazaki and Tatsuya Komatsu and Tomoki Hayashi and Shinji Watanabe and Tomoki Toda and K. Takeda},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2020},
  pages={66-70}
}
In this paper, we propose a novel sound event detection (SED) method that incorporates a self-attention mechanism of the Transformer for a weakly-supervised learning scenario. The proposed method utilizes the Transformer encoder, which consists of multiple self-attention modules, allowing to take both local and global context information of the input feature sequence into account. Furthermore, inspired by the great success of BERT in the natural language processing field, the proposed method… 

Figures and Tables from this paper

SEMI-SUPERVISED SOUND EVENT DETECTION USING MULTISCALE CHANNEL ATTENTION AND MULTIPLE CONSISTENCY TRAINING
TLDR
A neural network-based sound event detection system that outputs sound events and their time boundaries in audio signals and employs multi-scale CNNs with efficient channel attention, which can capture the various features and pay more attention to the important area of features.
SP-SEDT: Self-supervised Pre-training for Sound Event Detection Transformer
TLDR
Inspired by the great success of UP-DETR in object detection, this work proposes to self-supervisedly pre-train SEDT (SP-SEDT) by detecting random patches (only cropped along the time axis) and experiments show the proposed SP-S EDT can outperform the frame-based model.
CONVOLUTION-AUGMENTED TRANSFORMER FOR SEMI-SUPERVISED SOUND EVENT DETECTION Technical Report
TLDR
This model employs conformer blocks, which combine the self-attention and depth-wise convolution networks, to efficiently capture the global and local context information of an audio feature sequence.
CONFORMER-BASED SOUND EVENT DETECTION WITH SEMI-SUPERVISED LEARNING AND DATA AUGMENTATION
TLDR
The proposed method employs Conformer, a convolution-augmented Transformer that is able to exploit local features of audio data more effectively using CNNs, while global features are captured with Transformer, which uses semi-supervised learning and data augmentation.
Sound Event Detection by Pseudo-Labeling in Weakly Labeled Dataset
TLDR
A more efficient model is constructed by employing a gated linear unit (GLU) and dilated convolution to improve the problems of de-emphasizing importance and lack of receptive field and a pseudo-label-based learning for classifying target contents and unknown contents is proposed by adding ’noise label’ and ‘noise loss’ so that unknown contents can be separated as much as possible through the noise label.
CHT+NSYSU SOUND EVENT DETECTION SYSTEM WITH MULTISCALE CHANNEL ATTENTION AND MULTIPLE CONSISTENCY TRAINING FOR DCASE 2021 TASK 4 Technical Report
TLDR
This technical report describes the submission system for DCASE 2021 Task4: sound event detection and separation in domestic environments based on meanteacher framework of semi-supervised learning and neural networks of CRNN and CNN-Transformer, which achieves the PSDS-scenario1 of 40.72% and PSDS -scenario2 of 80.80% on the validation set.
Joint Weakly Supervised AT and AED Using Deep Feature Distillation and Adaptive Focal Loss
TLDR
This study proposes three methods to improve the best teacherstudent framework of DCASE2019 Task 4 for both AT and AED tasks, including a frame-level target-events based deep feature distillation and an adaptive focal loss and two-stage training strategy to enable an effective and more accurate model training.
Dilated convolution and gated linear unit based sound event detection and tagging algorithm using weak label
TLDR
A Dilated Convolution Gate Linear Unit (DCGLU) is proposed to mitigate the lack of sparsity and small receptive field problems caused by the segmentation map extraction process in sound event detection with weak labels and is shown to exhibit robustness against nature sound noises.
Self-Trained Audio Tagging and Sound Event Detection in Domestic Environments
TLDR
This paper uses a forward-backward convolutional recurrent neural network for tagging and pseudo labeling followed by tag-conditioned sound event detection (SED) models which are trained using strong pseudo labels provided by the FBCRNN and introduces a strong label loss in the objective of the F BCRNN to take advantage of the strongly labeled synthetic data during training.
...
1
2
3
4
...

References

SHOWING 1-10 OF 36 REFERENCES
Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection
TLDR
This work introduces a convolutional neural network (CNN) with a large input field for AED that significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.
Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Recognition
TLDR
This work introduces a convolutional neural network (CNN) with a large input field for AED that significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.
A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling
TLDR
This paper builds a neural network called TALNet, which is the first system to reach state-of-the-art audio tagging performance on Audio Set, while exhibiting strong localization performance on the DCASE 2017 challenge at the same time.
SELF-ATTENTION MECHANISM BASED SYSTEM FOR DCASE 2018 CHALLENGE TASK 1 AND TASK 4
In this technique report, we provide self-attention mechanism for the Task1 and Task 4 of Detection and Classification of Acoustic Scenes and Events 2018 (DCASE2017) challenge. We take convolutional
Sound event detection using spatial features and convolutional recurrent neural network
TLDR
This paper proposes to use low-level spatial features extracted from multichannel audio for sound event detection and shows that instead of concatenating the features of each channel into a single feature vector the network learns sound events in multich channel audio better when they are presented as separate layers of a volume.
Duration-Controlled LSTM for Polyphonic Sound Event Detection
TLDR
This paper builds upon a state-of-the-art SED method that performs frame-by-frame detection using a bidirectional LSTM recurrent neural network, and incorporates a duration-controlled modeling technique based on a hidden semi-Markov model that makes it possible to model the duration of each sound event precisely and to perform sequence- by-sequence detection without having to resort to thresholding.
Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
TLDR
This work combines these two approaches in a convolutional recurrent neural network (CRNN) and applies it on a polyphonic sound event detection task and observes a considerable improvement for four different datasets consisting of everyday sound events.
Close to Human Quality TTS with Transformer
TLDR
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.
Audio Event Detection using Weakly Labeled Data
TLDR
It is shown that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem and two frameworks for solving multiple-instance learning are suggested, one based on support vector machines, and the other on neural networks.
Sound Event Detection in Domestic Environments with Weakly Labeled Data and Soundscape Synthesis
TLDR
The paper introduces Domestic Environment Sound Event Detection (DESED) dataset mixing a part of last year dataset and an additional synthetic, strongly labeled, dataset provided this year that’s described more in detail.
...
1
2
3
4
...