A Deep Residual Network for Large-Scale Acoustic Scene Analysis

@inproceedings{Ford2019ADR,
  title={A Deep Residual Network for Large-Scale Acoustic Scene Analysis},
  author={Logan Ford and Hao Tang and François Grondin and James R. Glass},
  booktitle={INTERSPEECH},
  year={2019}
}
Many of the recent advances in audio event detection, particularly on the AudioSet data set, have focused on improving performance using the released embeddings produced by a pretrained model. In this work, we instead study the task of training a multi-label event classifier directly from the audio recordings of AudioSet. Using the audio recordings, not only are we able to reproduce results from prior work, we have also confirmed improvements of other proposed additions, such as an attention… 

Figures and Tables from this paper

A Global-Local Attention Framework for Weakly Labelled Audio Tagging
  • Helin Wang, Yuexian Zou, Wenwu Wang
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
TLDR
A novel two-stream framework for audio tagging by exploiting the global and local information of sound events and can significantly improve the performance of audio tagging under different baseline network architectures.
AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification
TLDR
This work compares a few state-of-the-art baselines on the AT task, and study the performance and efficiency of 2 major categories of neural architectures: CNN variants and attention-based variants, and closely examine their optimization procedures.
PSLA: Improving Audio Event Classification with Pretraining, Sampling, Labeling, and Aggregation
TLDR
PSLA is presented, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices that achieves a new state-of-the-art mean average precision on AudioSet.
Large Scale Audiovisual Learning of Sounds with Weakly Labeled Data
TLDR
It is advocated that sound recognition is inherently a multi-modal audiovisual task in that it is easier to differentiate sounds using both the audio and visual modalities as opposed to one or the other.
Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking
TLDR
This work proposes a simple and model-agnostic method based on a teacher-student framework with loss masking to first identify the most critical missing label candidates, and then ignore their contribution during the learning process, finding that a simple optimisation of the training label set improves recognition performance without additional computation.
ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition
TLDR
A new convolutional neural network architecture and a method for proving the inference speed of CNN-based systems for APR tasks are proposed and improved, and these systems are named “Efficient Residual Audio Neural Networks”.
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection
TLDR
HTS-AT is introduced: an audio transformer with a hierarchical structure to reduce the model size and training time, and is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection and localization in time.
Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training
TLDR
This paper presents a novel self-supervised learning method for transformer-based audio models, called masked spectrogram prediction (MaskSpec), to learn powerful audio representations from unlabeled audio data (Au-dioSet used in this paper).
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
TLDR
This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.
CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification
TLDR
An intriguing interaction is found between the two very different models CNN and AST models are good teachers for each other and when either of them is used as the teacher and the other model is trained as the student via knowledge distillation, the performance of the student model noticeably improves, and in many cases, is better than the teacher model.
...
...

References

SHOWING 1-10 OF 30 REFERENCES
Multi-level Attention Model for Weakly Supervised Audio Classification
TLDR
A multi-attention attention model which consists of multiple attention modules applied on the intermediate neural network layers that achieves a state-of-the-art mean average precision (mAP) of 0.360, outperforming the single attention model and the Google baseline system.
Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE 2016 Challenge
TLDR
The emergence of deep learning as the most popular classification method is observed, replacing the traditional approaches based on Gaussian mixture models and support vector machines.
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Unsupervised Learning of Semantic Audio Representations
  • A. Jansen, M. Plakal, R. Saurous
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.
ACOUSTIC SCENE CLASSIFICATION USING PARALLEL COMBINATION OF LSTM AND CNN
TLDR
This paper proposes a neural network architecture for the purpose of using sequential information that is composed of two separated lower networks and one upper network and refers to these as LSTM layers, CNN layers and connected layers, respectively.
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
TLDR
It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.
Audio Set Classification with Attention Model: A Probabilistic Perspective
This paper investigates the Audio Set classification. Audio Set is a large scale weakly labelled dataset (WLD) of audio clips. In WLD only the presence of a label is known, without knowing the
Weakly Labelled AudioSet Tagging With Attention Neural Networks
TLDR
This work bridges the connection between attention neural networks and multiple instance learning (MIL) methods, and proposes decision-level and feature-level attention neural Networks for audio tagging, which achieves a state-of-the-art mean average precision.
Robust sound event recognition using convolutional neural networks
TLDR
This work proposes novel features derived from spectrogram energy triggering, allied with the powerful classification capabilities of a convolutional neural network (CNN), which demonstrates excellent performance under noise-corrupted conditions when compared against state-of-the-art approaches on standard evaluation tasks.
Environmental sound classification with convolutional neural networks
  • Karol J. Piczak
  • Computer Science
    2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP)
  • 2015
TLDR
The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches.
...
...