• Corpus ID: 17589207

Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms

  title={Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms},
  author={Jongpil Lee and Jiyoung Park and Keunhyoung Luke Kim and Juhan Nam},
Recently, the end-to-end approach that learns hierarchical representations from raw data using deep convolutional neural networks has been successfully explored in the image, text and speech domains. This approach was applied to musical signals as well but has been not fully explored yet. To this end, we propose sample-level deep convolutional neural networks which learn representations from very small grains of waveforms (e.g. 2 or 3 samples) beyond typical frame-level input representations… 

Figures and Tables from this paper

SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification

A CNN architecture which learns representations using sample-level filters beyond typical frame-level input representations is proposed and extended using multi-level and multi-scale feature aggregation technique and subsequently conduct transfer learning for several music classification tasks.

A Multi-scale Convolutional Neural Network Architecture for Music Auto-Tagging

A convolutional neural network architecture which attempts to learn features over multiple timescales in the field of music auto-tagging and yields results close to the state of the art and comprehensively beats shallow classifiers trained on handcrafted features.

Sample-Level CNN Architectures for Music Auto-Tagging Using Raw Waveforms

This paper improves the 1-D CNN architecture for music auto-tagging by adopting building blocks from state-of-the-art image classification models, ResNets and SENets, and adding multi-level feature aggregation to it, and comparing different combinations of the modules in building CNN architectures.

Comparison and Analysis of SampleCNN Architectures for Audio Classification

SampleCNN is scrutinized further by comparing it with spectrogram-based CNN and changing the subsampling operation in three different audio domains and shows that the excitation in the first layer is sensitive to the loudness, which is an acoustic characteristic that distinguishes different genres of music.

Combining Multi-Scale Features Using Sample-Level Deep Convolutional Neural Networks for Weakly Supervised Sound Event Detection

This paper describes the method submitted to large-scale weakly supervised sound event detection for smart cars in the DCASE Challenge 2017, and shows that the waveform-based models can be comparable to spectrogrambased models when compared to other DCASE Task 4 submissions.

End-to-end Learning for Music Audio Tagging at Scale

This work focuses on studying how waveform-based models outperform spectrogram-based ones in large-scale data scenarios when datasets of variable size are available for training, suggesting that music domain assumptions are relevant when not enough training data are available.


This paper improves the 1-D CNN architecture for music auto-tagging by adopting building blocks from state-of-the-art image classification models, ResNets and SENets, and adding multi-level feature aggregation to it, and comparing different combinations of the modules in building CNN architectures.

Raw Waveform-based Audio Classification Using Sample-level CNN Architectures

Two types of sample-level deep convolutional neural networks that take raw waveforms as input and uses filters with small granularity reach state-of-the-art performance levels for the three different categories of sound.

On the Robustness of Deep Convolutional Neural Networks for Music Classification

It is shown that networks can be effective despite of relatively large error rates in groundtruth datasets, and it is subsequently shown that many commonly used input preprocessing techniques are redundant except magnitude compression.

RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification

This study proposes an end-to-end system that comprises two deep neural networks, one front-end for utterance-level speaker embedding extraction and the other for back-end classification that achieves state-of-the-art performance among systems without data augmentation.



End-to-end learning for music audio

  • S. DielemanB. Schrauwen
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
Although convolutional neural networks do not outperform a spectrogram-based approach, the networks are able to autonomously discover frequency decompositions from raw audio, as well as phase-and translation-invariant feature representations.

Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging

The experiments show that using the combination of multi-level and multi-scale features is highly effective in music auto-tagging and the proposed method outperforms the previous state-of-the-art methods on the MagnaTagATune dataset and the Million Song Dataset.

Experimenting with musically motivated convolutional neural networks

  • Jordi PonsT. LidyX. Serra
  • Computer Science
    2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI)
  • 2016
This article explores various architectural choices of relevance for music signals classification tasks in order to start understanding what the chosen networks are learning and proposes several musically motivated architectures.

Audio Deepdream: Optimizing raw audio with convolutional networks

This work has followed in the footsteps of Van den Oord et al and trained a network to predict embeddings that were themselves the result of a collaborative filtering model, which creates a chain of differentiable functions from raw audio to high level features.

Learning the speech front-end with raw waveform CLDNNs

It is shown that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech.

Applying Topological Persistence in Convolutional Neural Network for Music Audio Signals

This paper proposes to embed the so-called "persistence landscape," a rather new topological summary for data, into a convolutional neural network (CNN) for dealing with audio signals, and shows that the resulting persistent Convolutional Neural Network (PCNN) model can perform significantly better than state-of-the-art models in prediction accuracy.

Convolutional Neural Networks-based continuous speech recognition using raw speech signal

The studies show that the CNN-based approach achieves better performance than the conventional ANN- based approach with as many parameters and that the features learned from raw speech by the CNN -based approach could generalize across different databases.

ImageNet classification with deep convolutional neural networks

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

Deep neural networks are easily fooled: High confidence predictions for unrecognizable images

This work takes convolutional neural networks trained to perform well on either the ImageNet or MNIST datasets and finds images with evolutionary algorithms or gradient ascent that DNNs label with high confidence as belonging to each dataset class, and produces fooling images, which are then used to raise questions about the generality of DNN computer vision.

Visualizing Higher-Layer Features of a Deep Network

This paper contrast and compare several techniques applied on Stacked Denoising Autoencoders and Deep Belief Networks, trained on several vision datasets, and shows that good qualitative interpretations of high level features represented by such models are possible at the unit level.