Unsupervised Contrastive Learning of Sound Event Representations

@article{Fonseca2021UnsupervisedCL,
  title={Unsupervised Contrastive Learning of Sound Event Representations},
  author={Eduardo Fonseca and Diego Ortego and Kevin McGuinness and Noel E. O'Connor and Xavier Serra},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={371-375}
}
  • Eduardo Fonseca, Diego Ortego, X. Serra
  • Published 15 November 2020
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Self-supervised representation learning can mitigate the limitations in recognition tasks with few manually labeled data but abundant unlabeled data—a common scenario in sound event research. In this work, we explore unsupervised contrastive learning as a way to learn sound event representations. To this end, we propose to use the pretext task of contrasting differently augmented views of sound events. The views are computed primarily via mixing of training examples with unrelated backgrounds… 

Figures and Tables from this paper

Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations
TLDR
This work proposes an augmented contrastive SSL framework to learn invariant representations from unlabeled data and utilizes contrastive learning to learn representations robust to such perturbations.
Audio Self-supervised Learning: A Survey
TLDR
An overview of the SSL methods used for audio and speech processing applications, the empirical works that exploit the audio modality in multimodal SSL frameworks, and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain are summarized.
Self-Supervised Learning from Automatically Separated Sound Scenes
TLDR
This paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning and finds that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone.
Multimodal Self-Supervised Learning of General Audio Representations
TLDR
This work demonstrates that their contrastive framework does not require high resolution images to learn good audio features, and is advantageous on a broad range of non-semantic audio tasks, including speaker identification, keyword spotting, language identification, and music instrument classification.
Enriched Music Representations With Multiple Cross-Modal Contrastive Learning
TLDR
This paper aligns the latent representations obtained from playlists-track interactions, genre metadata, and the tracks’ audio, by maximizing the agreement between these modality representations using a contrastive loss.
Learning neural audio features without supervision
TLDR
First, it is shown that pretraining two previously proposed frontends (SincNet and LEAF) on Audioset drastically improves linear-probe performance over mel-filterbanks, suggesting that learnable time-frequency representations can bene-t self-supervised pre-training even more than supervised training.
Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation
TLDR
This paper seeks to learn audio representations from the input itself as supervision using a pretext task of auto-encoding of masked spectrogram patches, Masked Spectrogram Modeling (MSM, a variant of Masked Image Modeling applied to audio spectrogram), using Masked Autoencoders (MAE), an image self-supervised learning method.
Self-Trained Audio Tagging and Sound Event Detection in Domestic Environments
TLDR
This paper uses a forward-backward convolutional recurrent neural network for tagging and pseudo labeling followed by tag-conditioned sound event detection (SED) models which are trained using strong pseudo labels provided by the FBCRNN and introduces a strong label loss in the objective of the F BCRNN to take advantage of the strongly labeled synthetic data during training.
FSD50K: An Open Dataset of Human-Labeled Sound Events
TLDR
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
ATST: Audio Representation Learning with Teacher-Student Transformer
TLDR
This work addresses the problem of segment-level general audio SSL, and proposes a new transformer-based teacher-student SSL model, named ATST, which achieves the new state-of-the-art results on almost all of the downstream tasks.
...
1
2
...

References

SHOWING 1-10 OF 30 REFERENCES
Model-Agnostic Approaches To Handling Noisy Labels When Training Sound Event Classifiers
TLDR
This work evaluates simple and efficient model-agnostic approaches to handling noisy labels when training sound event classifiers, namely label smoothing regularization, mixup and noise-robust loss functions, which can be easily incorporated to existing deep learning pipelines without need for network modifications or extra resources.
Learning Sound Event Classifiers from Web Audio with Noisy Labels
TLDR
Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
Unsupervised Learning of Semantic Audio Representations
  • A. Jansen, M. Plakal, R. Saurous
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.
Representation Learning with Contrastive Predictive Coding
TLDR
This work proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which it calls Contrastive Predictive Coding, and demonstrates that the approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.
Tricycle: Audio Representation Learning from Sensor Network Data Using Self-Supervision
TLDR
A model for learning audio representations by predicting the long-term, cyclic temporal structure in audio data collected from an urban acoustic sensor network is presented and the utility of the learned audio representation in an urban sound event detection task with limited labeled data is demonstrated.
Multi-label Few-shot Learning for Sound Event Recognition
TLDR
A One-vs.-Rest episode selection strategy is proposed to mitigate the issue of the complexity of forming an episode and apply the strategy to the multi-label few-shot problem.
Pre-Training Audio Representations With Self-Supervision
TLDR
This work proposes two self-supervised tasks: Audio2Vec, which aims at reconstructing a spectrogram slice from past and future slices and TemporalGap, which estimates the distance between two short audio segments extracted at random from the same audio clip.
Unsupervised Feature Learning via Non-parametric Instance Discrimination
TLDR
This work forms this intuition as a non-parametric classification problem at the instance-level, and uses noise-contrastive estimation to tackle the computational challenges imposed by the large number of instance classes.
Data Augmenting Contrastive Learning of Speech Representations in the Time Domain
TLDR
WavAugment is intro-duce, a time-domain data augmentation library which is adapt and optimize for the specificities of CPC (raw waveform input, contrastive loss, past versus future structure), and finds that applying augmentation only to the segments from which the CPC prediction is performed yields better results.
Metric Learning with Background Noise Class for Few-Shot Detection of Rare Sound Events
TLDR
This paper aims to achieve few-shot detection of rare sound events, from query sequence that contain not only the target events but also the other events and background noise, and proposes metric learning with background noise class for the few- shot detection.
...
1
2
3
...