CNN architectures for large-scale audio classification

  title={CNN architectures for large-scale audio classification},
  author={Shawn Hershey and Sourish Chaudhuri and Daniel P. W. Ellis and Jort F. Gemmeke and Aren Jansen and R. Channing Moore and Manoj Plakal and Devin Platt and Rif A. Saurous and Bryan Seybold and Malcolm Slaney and Ron J. Weiss and Kevin W. Wilson},
  journal={2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Shawn Hershey, S. Chaudhuri, K. Wilson
  • Published 29 September 2016
  • Computer Science
  • 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image… 

Figures and Tables from this paper

Receptive Field Regularization Techniques for Audio Classification and Tagging With Deep Convolutional Neural Networks
The experiments show that regularizing the RF of CNNs using the proposed approaches can drastically improve the generalization of models, out-performing complex architectures and pre-trained models on larger datasets.
The Receptive Field as a Regularizer in Deep Convolutional Neural Networks for Acoustic Scene Classification
The receptive field (RF) of CNNs is analysed and the importance of the RF to the generalization capability of the models is demonstrated, showing that very small or very large RFs can cause performance degradation, but deep models can be made to generalize well by carefully choosing an appropriate RF size within a certain range.
Comparison and Analysis of SampleCNN Architectures for Audio Classification
SampleCNN is scrutinized further by comparing it with spectrogram-based CNN and changing the subsampling operation in three different audio domains and shows that the excitation in the first layer is sensitive to the loudness, which is an acoustic characteristic that distinguishes different genres of music.
Cross-modal supervised learning for better acoustic representations
This work proposes to exploit machine-generated labels to learn better acoustic representations, based on the synchronization between vision and audio, and trains various classical convolutional neural networks including VGGish, ResNet 50 and Mobilenet v2.
Variational Information Bottleneck for Effective Low-resource Audio Classification
Evaluation on a few audio datasets shows that the VIB framework is ready-to-use and could be easily utilized with many other state-of-the-art network architectures, and outperforms baseline methods.
Sample-Level CNN Architectures for Music Auto-Tagging Using Raw Waveforms
This paper improves the 1-D CNN architecture for music auto-tagging by adopting building blocks from state-of-the-art image classification models, ResNets and SENets, and adding multi-level feature aggregation to it, and comparing different combinations of the modules in building CNN architectures.
DCASE 2018 Challenge baseline with convolutional neural networks
Python implementation of DCASE 2018 has five tasks: 1) Acoustic scene classification, 2) General-purpose audio tagging, 3) Bird audio detection, 4) Weakly-labeled semi-supervised sound event detection and 5) Multi-channel audio tagging; the baseline source code contains the implementation of convolutional neural networks, including AlexNetish and VGGish -- networks originating from computer vision.
Learning and Fusing Multimodal Deep Features for Acoustic Scene Categorization
A novel acoustic scene classification system based on multimodal deep feature fusion is proposed, where three CNNs have been presented to perform 1D raw waveform modeling, 2D time-frequency image modeling, and 3D spatial-temporal dynamics modeling, respectively.
Audio Recognition using Mel Spectrograms and Convolution Neural Networks
This study takes advantage of the robust machine learning techniques developed for image classification and applies them on the sound recognition problem, achieving a label-weighted label-ranking average precision (LWLARP) score and top-5 accuracy of 0.813 and 88.9%, respectively, when predicting 80 sound classes.
Over-Parameterization and Generalization in Audio Classification
This study investigates the relationship between over-parameterization of acoustic scene classification models, and their resulting generalization abilities and indicates that increasing width improves generalization to unseen devices, even without an increase in the number of parameters.


Large-Scale Video Classification with Convolutional Neural Networks
This work studies multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggests a multiresolution, foveated architecture as a promising way of speeding up the training.
Beyond short snippets: Deep networks for video classification
This work proposes and evaluates several deep neural network architectures to combine image information across a video over longer time periods than previously attempted, and proposes two methods capable of handling full length videos.
Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection
This work introduces a convolutional neural network (CNN) with a large input field for AED that significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.
Very Deep Convolutional Networks for Large-Scale Image Recognition
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks
This paper takes advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture, and finds that the CLDNN provides a 4-6% relative improvement in WER over an LSTM, the strongest of the three individual models.
ImageNet classification with deep convolutional neural networks
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Deep Residual Learning for Image Recognition
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
Rethinking the Inception Architecture for Computer Vision
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
Recurrent neural networks for polyphonic sound event detection in real life recordings
In this paper we present an approach to polyphonic sound event detection in real life recordings based on bi-directional long short term memory (BLSTM) recurrent neural networks (RNNs). A single
An exemplar-based NMF approach to audio event detection
A novel, exemplar-based method for audio event detection based on non-negative matrix factorisation, which model events as a linear combination of dictionary atoms, and mixtures as alinear combination of overlapping events.