PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation

  title={PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and Aggregation},
  author={Yuan Gong and Yu-An Chung and James R. Glass},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
Audio tagging is an active research area and has a wide range of applications. Since the release of AudioSet, great progress has been made in advancing model performance, which mostly comes from the development of novel model architectures and attention modules. However, we find that appropriate training techniques are equally important for building audio tagging models with AudioSet, but have not received the attention they deserve. To fill the gap, in this work, we present PSLA, a collection… 

Wav2CLIP: Learning Robust Audio Representations From CLIP

Wav2CLIP is proposed, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP), and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model.

A Preliminary Study on Environmental Sound Classification Leveraging Large-Scale Pretrained Model and Semi-Supervised Learning

To simulate a low-resource sound classification setting where only limited supervised examples are made available, the notion of transfer learning is instantiated with a recently proposed training algorithm and a data augmentation method to achieve the goal of semi-supervised model training.

Efficient Training of Audio Transformers with Patchout

This work proposes a novel method to op-timize and regularize transformers on audio spectrograms with a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU.

ERANNs: Efficient Residual Audio Neural Networks for Audio Pattern Recognition

Study of Positional Encoding Approaches for Audio Spectrogram Transformers

This paper studies one component of the AST, the positional encoding, and proposes several variants to improve the performance of ASTs trained from scratch, without ImageNet pretraining.

Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition

  • Yuan GongJingbo YuJames R. Glass
  • Computer Science, Physics
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
A VocalSound dataset consisting of over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs from 3,365 unique subjects is created to support research on building robust and accurate vocal sound recognition.

Learning the Spectrogram Temporal Resolution for Audio Classification

Starting from a high-temporal-resolution spectrogram such as one-millisecond hop size, it is shown that DiffRes can improve classification accuracy with the same computational complexity, which alleviates the computational cost at the same time.

AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification

This work compares a few state-of-the-art baselines on the AT task, and study the performance and efficiency of 2 major categories of neural architectures: CNN variants and attention-based variants, and closely examine their optimization procedures.

Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data

A three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet, which achieves comparable Source-to-Distortion Ratio (SDR) performance to current supervised models in both cases.

Ontology-aware Learning and Evaluation for Audio Tagging

A new evaluation metric for audio tagging tasks to overcome the limitation of the conventional mean average precision (mAP) metric, which treats different kinds of sound as inde-pendent classes without considering their relations, and a novel loss function that reweights binary cross entropy loss based on the ontology distance.



Weakly Labelled AudioSet Tagging With Attention Neural Networks

This work bridges the connection between attention neural networks and multiple instance learning (MIL) methods, and proposes decision-level and feature-level attention neural Networks for audio tagging, which achieves a state-of-the-art mean average precision.

A Deep Residual Network for Large-Scale Acoustic Scene Analysis

The task of training a multi-label event classifier directly from the audio recordings of AudioSet is studied and it is found that the models are able to localize audio events when a finer time resolution is needed.

Rethinking CNN Models for Audio Classification

It is shown that ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification and qualitative results of what the CNNs learn from the spectrograms by visualizing the gradients are shown.

Contrastive Learning of General-Purpose Audio Representations

This work builds on top of recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement self-supervised model of audio, and shows that despite its simplicity, this method significantly outperforms previous self- supervised systems.

Pre-Training Audio Representations With Self-Supervision

This work proposes two self-supervised tasks: Audio2Vec, which aims at reconstructing a spectrogram slice from past and future slices and TemporalGap, which estimates the distance between two short audio segments extracted at random from the same audio clip.

Multi-level attention model for weakly supervised audio classification

A multi-attention attention model which consists of multiple attention modules applied on the intermediate neural network layers that achieves a state-of-the-art mean average precision (mAP) of 0.360, outperforming the single attention model and the Google baseline system.

Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking

This work proposes a simple and model-agnostic method based on a teacher-student framework with loss masking to first identify the most critical missing label candidates, and then ignore their contribution during the learning process, finding that a simple optimisation of the training label set improves recognition performance without additional computation.

Audio Set Classification with Attention Model: A Probabilistic Perspective

This paper investigates the Audio Set classification. Audio Set is a large scale weakly labelled dataset (WLD) of audio clips. In WLD only the presence of a label is known, without knowing the

A Closer Look at Weak Label Learning for Audio Events

This work describes a CNN based approach for weakly supervised training of audio events and describes important characteristics, which naturally arise inweakly supervised learning of sound events, and shows how these aspects of weak labels affect the generalization of models.

Knowledge Transfer from Weakly Labeled Audio Using Convolutional Neural Network for Sound Events and Scenes

This work describes a convolutional neural network (CNN) based framework for sound event detection and classification using weakly labeled audio data and proposes methods to learn representations using this model which can be effectively used for solving the target task.