• Corpus ID: 966171

Raw Waveform-based Audio Classification Using Sample-level CNN Architectures

  title={Raw Waveform-based Audio Classification Using Sample-level CNN Architectures},
  author={Jongpil Lee and Taejun Kim and Jiyoung Park and Juhan Nam},
Music, speech, and acoustic scene sound are often handled separately in the audio domain because of their different signal characteristics. [] Key Method One is a basic model that consists of convolution and pooling layers. The other is an improved model that additionally has residual connections, squeeze-and-excitation modules and multi-level concatenation. We show that the sample-level models reach state-of-the-art performance levels for the three different categories of sound. Also, we visualize the…

Figures and Tables from this paper

Comparison and Analysis of SampleCNN Architectures for Audio Classification

SampleCNN is scrutinized further by comparing it with spectrogram-based CNN and changing the subsampling operation in three different audio domains and shows that the excitation in the first layer is sensitive to the loudness, which is an acoustic characteristic that distinguishes different genres of music.

Adaptive Distance-Based Pooling in Convolutional Neural Networks for Audio Event Classification

A new type of pooling layer is proposed aimed at compensating non-relevant information of audio events by applying an adaptive transformation of the convolutional feature maps in the temporal axis that follows a uniform distance subsampling criterion on the learned feature space.

Acoustic Scene Classification With Squeeze-Excitation Residual Networks

Two novel squeeze-excitation blocks are proposed to improve the accuracy of a CNN-based ASC framework based on residual learning and exceed the performance of the baseline proposed by the DCASE organization by 13% percentage points.

Environmental Sound Classification with Parallel Temporal-Spectral Attention

A novel parallel temporal-spectral attention mechanism for CNN to learn discriminative sound representations is proposed, which enhances the temporal and spectral features by capturing the importance of different time frames and frequency bands.

Learning discriminative and robust time-frequency representations for environmental sound classification

A new method is proposed, called time-frequency enhancement block (TFBlock), which temporal attention and frequency attention are employed to enhance the features from relevant frames and frequency bands, which improves the classification performance and also exhibits robustness to noise.

A deep convolutionary network for automatic detection of audio events

This paper proposes an event detection architecture based on a convolutional neural network that takes in input directly the raw sampled waveform and is able to automatically learn the most important frequencies for the sounds of interests to be recognized.

Artificially Synthesising Data for Audio Classification and Segmentation to Improve Speech and Music Detection in Radio Broadcast

  • S. VenkateshD. Moffat E. Miranda
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
The data synthesis procedure is demonstrated as a highly effective technique to generate large datasets to train deep neural networks for audio segmentation and outperformed state-of-the-art algorithms for music-speech detection.

What Affects the Performance of Convolutional Neural Networks for Audio Event Classification

This paper designs convolutional neural networks for audio event classification (called FPNet), and on the environmental sounds dataset ESC-50, the classification accuracies of FPNet-1D andFPNet-2D achieve 73.90% and 85.10% respectively, which improve significantly comparing to the previous methods.

You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and Sound Event Detection

This paper presents a novel approach called You Only Hear Once (YOHO), which is inspired by the YOLO algorithm popularly adopted in Computer Vision, and converts the detection of acoustic boundaries into a regression problem instead of frame-based classification.



Sample-Level CNN Architectures for Music Auto-Tagging Using Raw Waveforms

This paper improves the 1-D CNN architecture for music auto-tagging by adopting building blocks from state-of-the-art image classification models, ResNets and SENets, and adding multi-level feature aggregation to it, and comparing different combinations of the modules in building CNN architectures.

Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms

The experiments show how deep architectures with sample-level filters improve the accuracy in music auto-tagging and they provide results comparable to previous state-of-the-art performances for the Magnatagatune dataset and Million Song Dataset.

Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging

The experiments show that using the combination of multi-level and multi-scale features is highly effective in music auto-tagging and the proposed method outperforms the previous state-of-the-art methods on the MagnaTagATune dataset and the Million Song Dataset.

Very deep convolutional neural networks for raw waveforms

This work proposes very deep convolutional neural networks that directly use time-domain waveforms as inputs that are efficient to optimize over very long sequences, necessary for processing acoustic waveforms.

End-to-end learning for music audio

  • S. DielemanB. Schrauwen
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
Although convolutional neural networks do not outperform a spectrogram-based approach, the networks are able to autonomously discover frequency decompositions from raw audio, as well as phase-and translation-invariant feature representations.

Combining Multi-Scale Features Using Sample-Level Deep Convolutional Neural Networks for Weakly Supervised Sound Event Detection

This paper describes the method submitted to large-scale weakly supervised sound event detection for smart cars in the DCASE Challenge 2017, and shows that the waveform-based models can be comparable to spectrogrambased models when compared to other DCASE Task 4 submissions.

Learning the speech front-end with raw waveform CLDNNs

It is shown that raw waveform features match the performance of log-mel filterbank energies when used with a state-of-the-art CLDNN acoustic model trained on over 2,000 hours of speech.

Stacked convolutional and recurrent neural networks for bird audio detection

Data augmentation by blocks mixing and domain adaptation using a novel method of test mixing are proposed and evaluated in regard to making the method robust to unseen data.

Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network

A stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label, which achieves the best error rate of 0.84 for strong labels and F-score of 43.3% for weak labels on the unseen test split is proposed.

Experimenting with musically motivated convolutional neural networks

  • Jordi PonsT. LidyX. Serra
  • Computer Science
    2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI)
  • 2016
This article explores various architectural choices of relevance for music signals classification tasks in order to start understanding what the chosen networks are learning and proposes several musically motivated architectures.