General audio tagging with ensembling convolutional neural network and statistical features

@article{Xu2019GeneralAT,
  title={General audio tagging with ensembling convolutional neural network and statistical features},
  author={Kele Xu and Boqing Zhu and Qiuqiang Kong and Haibo Mi and Bo Ding and Dezhi Wang and Huaimin Wang},
  journal={The Journal of the Acoustical Society of America},
  year={2019},
  volume={145 6},
  pages={
          EL521
        }
}
Audio tagging aims to infer descriptive labels from audio clips and it is challenging due to the limited size of data and noisy labels. The solution to the tagging task is described in this paper. The main contributions include the following: an ensemble learning framework is applied to ensemble statistical features and the outputs from the deep classifiers, with the goal to utilize complementary information. Moreover, a sample re-weight strategy is employed to address the noisy label problem… 

Figures and Tables from this paper

Reinforcement Learning based Neural Architecture Search for Audio Tagging
  • Haiyang Liu, C. Zhang
  • Computer Science
    2020 International Joint Conference on Neural Networks (IJCNN)
  • 2020
TLDR
This paper proposes to use the Convolutional Recurrent Neural Network with Attention and Location (ATT-LOC) as the audio tagging model, and proposes to apply Neural Architecture Search to search for the optimal number of filters and the filter size.
Multi-Representation Knowledge Distillation For Audio Classification
TLDR
A novel end-to-end collaborative learning framework that takes multiple representations as the input to train the models in parallel and can improve the classification performance and achieve state-of-the-art results on both acoustic scene classification tasks and general audio tagging tasks.
Weakly supervised CRNN system for sound event detection with large-scale unlabeled in-domain data
TLDR
A state-of-the-art general audio tagging model is first employed to predict weak labels for unlabeled data, and a weakly supervised architecture based on the convolutional recurrent neural network is developed to solve the strong annotations of sound events with the aid of the unlabeling data with predicted labels.
Audio Tagging Using CNN Based Audio Neural Networks for Massive Data Processing
TLDR
A large-scale audio dataset is used for training a pre-trained audio neural network that outperforms the existing systems with a mean average of 0.45 and the performance of the proposed model is demonstrated by applying theaudio neural network to five specific audio pattern recognition tasks.
Multimodal Deep Learning for Social Media Popularity Prediction With Attention Mechanism
TLDR
A novel multimodal deep learning framework for the popularity prediction task, which aims to leverage the complementary knowledge from different modalities, is proposed and results show that the proposed framework outperforms related approaches.
Music Artist Classification with WaveNet Classifier for Raw Waveform Audio Data
TLDR
An end-to-end architecture in the time domain for music artist classification is proposed and the bottleneck layer of the model is visualized to show the effectiveness of feature learning of the proposed method.
Spoken Language Identification using ConvNets
TLDR
A new attention based model for language identification which uses log-Mel spectrogram images as input is proposed and the effectiveness of raw waveforms as features to neural network models for LI tasks is presented.
Glottal Source Information for Pathological Voice Detection
TLDR
The evaluation of both approaches demonstrate that automatic detection of pathological voice from healthy speech benefits from using glottal source information.
Heartbeat Sound Signal Classification Using Deep Learning
TLDR
A purposed model Recurrent Neural Network (RNN) that is based on Long Short-Term Memory (LSTM), Dropout, Dense and Softmax layer is applied that is more competitive compared to other methods.
Ensembling Learning Based Melanoma Classification Using Gradient Boosting Decision Trees
TLDR
Both the personal information (such as the age, gender information of the patients) and latest deep learning approaches are applied and demonstrated enormous advantages for the ensemble learning framework in detecting task.
...
1
2
...

References

SHOWING 1-10 OF 24 REFERENCES
Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging
TLDR
A shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multilabel classification task and a symmetric or asymmetric deep denoising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-filter banks features.
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Sample Dropout for Audio Scene Classification Using Multi-Scale Dense Connected Convolutional Neural Network
TLDR
Inspired by the silence removal in the speech signal processing, a novel sample dropout approach is proposed, which aims to remove outliers in the training dataset, and can further improve the classification robustness of multi-scale DenseNet.
Mixup-Based Acoustic Scene Classification Using Multi-Channel Convolutional Neural Network
TLDR
This paper explores the use of Multi-channel CNN for the classification task, which aims to extract features from different channels in an end-to-end manner, and explores the using of mixup method, which can provide higher prediction accuracy and robustness in contrast with previous models.
DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System
TLDR
This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset.
Squeeze-and-Excitation Networks
  • Jie Hu, Li Shen, Gang Sun
  • Computer Science
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
TLDR
This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.
General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline
TLDR
The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 diverse categories drawn from the AudioSet Ontology.
Densely Connected Convolutional Networks for Speech Recognition
TLDR
Experimental results show that DenseNet can be used for AM significantly outperforming other neural-based models such as DNNs, CNNs, VGGs.
Very Deep Convolutional Networks for Large-Scale Image Recognition
TLDR
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
CP-JKU SUBMISSIONS FOR DCASE-2016 : A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS
TLDR
This report describes the 4 submissions for Task 1 (Audio scene classification) of the DCASE-2016 challenge of the CP-JKU team and proposes a novel i-vector extraction scheme for ASC using both left and right audio channels and a Deep Convolutional Neural Network architecture trained on spectrograms of audio excerpts in end-to-end fashion.
...
1
2
3
...