General audio tagging with ensembling convolutional neural network and statistical features
@article{Xu2019GeneralAT, title={General audio tagging with ensembling convolutional neural network and statistical features}, author={Kele Xu and Boqing Zhu and Qiuqiang Kong and Haibo Mi and Bo Ding and Dezhi Wang and Huaimin Wang}, journal={The Journal of the Acoustical Society of America}, year={2019}, volume={145 6}, pages={ EL521 } }
Audio tagging aims to infer descriptive labels from audio clips and it is challenging due to the limited size of data and noisy labels. The solution to the tagging task is described in this paper. The main contributions include the following: an ensemble learning framework is applied to ensemble statistical features and the outputs from the deep classifiers, with the goal to utilize complementary information. Moreover, a sample re-weight strategy is employed to address the noisy label problem…Â
18 Citations
Reinforcement Learning based Neural Architecture Search for Audio Tagging
- Computer Science2020 International Joint Conference on Neural Networks (IJCNN)
- 2020
This paper proposes to use the Convolutional Recurrent Neural Network with Attention and Location (ATT-LOC) as the audio tagging model, and proposes to apply Neural Architecture Search to search for the optimal number of filters and the filter size.
Multi-Representation Knowledge Distillation For Audio Classification
- Computer ScienceMultim. Tools Appl.
- 2022
A novel end-to-end collaborative learning framework that takes multiple representations as the input to train the models in parallel and can improve the classification performance and achieve state-of-the-art results on both acoustic scene classification tasks and general audio tagging tasks.
Weakly supervised CRNN system for sound event detection with large-scale unlabeled in-domain data
- Computer ScienceArXiv
- 2018
A state-of-the-art general audio tagging model is first employed to predict weak labels for unlabeled data, and a weakly supervised architecture based on the convolutional recurrent neural network is developed to solve the strong annotations of sound events with the aid of the unlabeling data with predicted labels.
Audio Tagging Using CNN Based Audio Neural Networks for Massive Data Processing
- Computer ScienceDecember 2021
- 2021
A large-scale audio dataset is used for training a pre-trained audio neural network that outperforms the existing systems with a mean average of 0.45 and the performance of the proposed model is demonstrated by applying theaudio neural network to five specific audio pattern recognition tasks.
Multimodal Deep Learning for Social Media Popularity Prediction With Attention Mechanism
- Computer ScienceACM Multimedia
- 2020
A novel multimodal deep learning framework for the popularity prediction task, which aims to leverage the complementary knowledge from different modalities, is proposed and results show that the proposed framework outperforms related approaches.
Music Artist Classification with WaveNet Classifier for Raw Waveform Audio Data
- Computer ScienceArXiv
- 2020
An end-to-end architecture in the time domain for music artist classification is proposed and the bottleneck layer of the model is visualized to show the effectiveness of feature learning of the proposed method.
Spoken Language Identification using ConvNets
- Computer Science, LinguisticsAmI
- 2019
A new attention based model for language identification which uses log-Mel spectrogram images as input is proposed and the effectiveness of raw waveforms as features to neural network models for LI tasks is presented.
Glottal Source Information for Pathological Voice Detection
- Computer ScienceIEEE Access
- 2020
The evaluation of both approaches demonstrate that automatic detection of pathological voice from healthy speech benefits from using glottal source information.
Heartbeat Sound Signal Classification Using Deep Learning
- Computer ScienceSensors
- 2019
A purposed model Recurrent Neural Network (RNN) that is based on Long Short-Term Memory (LSTM), Dropout, Dense and Softmax layer is applied that is more competitive compared to other methods.
Ensembling Learning Based Melanoma Classification Using Gradient Boosting Decision Trees
- Computer ScienceAIPR
- 2020
Both the personal information (such as the age, gender information of the patients) and latest deep learning approaches are applied and demonstrated enormous advantages for the ensemble learning framework in detecting task.
References
SHOWING 1-10 OF 24 REFERENCES
Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2017
A shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multilabel classification task and a symmetric or asymmetric deep denoising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-filter banks features.
CNN architectures for large-scale audio classification
- Computer Science2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2017
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Sample Dropout for Audio Scene Classification Using Multi-Scale Dense Connected Convolutional Neural Network
- Computer SciencePKAW
- 2018
Inspired by the silence removal in the speech signal processing, a novel sample dropout approach is proposed, which aims to remove outliers in the training dataset, and can further improve the classification robustness of multi-scale DenseNet.
Mixup-Based Acoustic Scene Classification Using Multi-Channel Convolutional Neural Network
- Computer SciencePCM
- 2018
This paper explores the use of Multi-channel CNN for the classification task, which aims to extract features from different channels in an end-to-end manner, and explores the using of mixup method, which can provide higher prediction accuracy and robustness in contrast with previous models.
DCASE2017 Challenge Setup: Tasks, Datasets and Baseline System
- Computer Science, PhysicsDCASE
- 2017
This paper presents the setup of these tasks: task definition, dataset, experimental setup, and baseline system results on the development dataset.
Squeeze-and-Excitation Networks
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
This work proposes a novel architectural unit, which is term the "Squeeze-and-Excitation" (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and finds that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at minimal additional computational cost.
General-purpose Tagging of Freesound Audio with AudioSet Labels: Task Description, Dataset, and Baseline
- Computer ScienceDCASE
- 2018
The goal of the task is to build an audio tagging system that can recognize the category of an audio clip from a subset of 41 diverse categories drawn from the AudioSet Ontology.
Densely Connected Convolutional Networks for Speech Recognition
- Computer ScienceITG Symposium on Speech Communication
- 2018
Experimental results show that DenseNet can be used for AM significantly outperforming other neural-based models such as DNNs, CNNs, VGGs.
Very Deep Convolutional Networks for Large-Scale Image Recognition
- Computer ScienceICLR
- 2015
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
CP-JKU SUBMISSIONS FOR DCASE-2016 : A HYBRID APPROACH USING BINAURAL I-VECTORS AND DEEP CONVOLUTIONAL NEURAL NETWORKS
- Computer Science
- 2016
This report describes the 4 submissions for Task 1 (Audio scene classification) of the DCASE-2016 challenge of the CP-JKU team and proposes a novel i-vector extraction scheme for ASC using both left and right audio channels and a Deep Convolutional Neural Network architecture trained on spectrograms of audio excerpts in end-to-end fashion.