Unsupervised Contrastive Learning of Sound Event Representations
@article{Fonseca2021UnsupervisedCL, title={Unsupervised Contrastive Learning of Sound Event Representations}, author={Eduardo Fonseca and Diego Ortego and Kevin McGuinness and Noel E. O'Connor and Xavier Serra}, journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2021}, pages={371-375} }
Self-supervised representation learning can mitigate the limitations in recognition tasks with few manually labeled data but abundant unlabeled data—a common scenario in sound event research. In this work, we explore unsupervised contrastive learning as a way to learn sound event representations. To this end, we propose to use the pretext task of contrasting differently augmented views of sound events. The views are computed primarily via mixing of training examples with unrelated backgrounds…
19 Citations
Augmented Contrastive Self-Supervised Learning for Audio Invariant Representations
- Computer ScienceArXiv
- 2021
This work proposes an augmented contrastive SSL framework to learn invariant representations from unlabeled data and utilizes contrastive learning to learn representations robust to such perturbations.
Audio Self-supervised Learning: A Survey
- Computer ScienceArXiv
- 2022
An overview of the SSL methods used for audio and speech processing applications, the empirical works that exploit the audio modality in multimodal SSL frameworks, and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain are summarized.
Self-Supervised Learning from Automatically Separated Sound Scenes
- Computer Science2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 2021
This paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning and finds that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone.
Multimodal Self-Supervised Learning of General Audio Representations
- Computer ScienceArXiv
- 2021
This work demonstrates that their contrastive framework does not require high resolution images to learn good audio features, and is advantageous on a broad range of non-semantic audio tasks, including speaker identification, keyword spotting, language identification, and music instrument classification.
Enriched Music Representations With Multiple Cross-Modal Contrastive Learning
- Computer ScienceIEEE Signal Processing Letters
- 2021
This paper aligns the latent representations obtained from playlists-track interactions, genre metadata, and the tracks’ audio, by maximizing the agreement between these modality representations using a contrastive loss.
Learning neural audio features without supervision
- Computer ScienceArXiv
- 2022
First, it is shown that pretraining two previously proposed frontends (SincNet and LEAF) on Audioset drastically improves linear-probe performance over mel-filterbanks, suggesting that learnable time-frequency representations can bene-t self-supervised pre-training even more than supervised training.
Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation
- Computer ScienceArXiv
- 2022
This paper seeks to learn audio representations from the input itself as supervision using a pretext task of auto-encoding of masked spectrogram patches, Masked Spectrogram Modeling (MSM, a variant of Masked Image Modeling applied to audio spectrogram), using Masked Autoencoders (MAE), an image self-supervised learning method.
Self-Trained Audio Tagging and Sound Event Detection in Domestic Environments
- Computer ScienceDCASE
- 2021
This paper uses a forward-backward convolutional recurrent neural network for tagging and pseudo labeling followed by tag-conditioned sound event detection (SED) models which are trained using strong pseudo labels provided by the FBCRNN and introduces a strong label loss in the objective of the F BCRNN to take advantage of the strongly labeled synthetic data during training.
FSD50K: An Open Dataset of Human-Labeled Sound Events
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2022
FSD50K is introduced, an open dataset containing over 51 k audio clips totalling over 100 h of audio manually labeled using 200 classes drawn from the AudioSet Ontology, to provide an alternative benchmark dataset and thus foster SER research.
ATST: Audio Representation Learning with Teacher-Student Transformer
- Computer ScienceArXiv
- 2022
This work addresses the problem of segment-level general audio SSL, and proposes a new transformer-based teacher-student SSL model, named ATST, which achieves the new state-of-the-art results on almost all of the downstream tasks.
References
SHOWING 1-10 OF 30 REFERENCES
Model-Agnostic Approaches To Handling Noisy Labels When Training Sound Event Classifiers
- Computer Science2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 2019
This work evaluates simple and efficient model-agnostic approaches to handling noisy labels when training sound event classifiers, namely label smoothing regularization, mixup and noise-robust loss functions, which can be easily incorporated to existing deep learning pipelines without need for network modifications or extra resources.
Learning Sound Event Classifiers from Web Audio with Noisy Labels
- Computer ScienceICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2019
Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
Unsupervised Learning of Semantic Audio Representations
- Computer Science2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2018
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.
Representation Learning with Contrastive Predictive Coding
- Computer ScienceArXiv
- 2018
This work proposes a universal unsupervised learning approach to extract useful representations from high-dimensional data, which it calls Contrastive Predictive Coding, and demonstrates that the approach is able to learn useful representations achieving strong performance on four distinct domains: speech, images, text and reinforcement learning in 3D environments.
Tricycle: Audio Representation Learning from Sensor Network Data Using Self-Supervision
- Computer Science2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
- 2019
A model for learning audio representations by predicting the long-term, cyclic temporal structure in audio data collected from an urban acoustic sensor network is presented and the utility of the learned audio representation in an urban sound event detection task with limited labeled data is demonstrated.
Multi-label Few-shot Learning for Sound Event Recognition
- Computer Science2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP)
- 2019
A One-vs.-Rest episode selection strategy is proposed to mitigate the issue of the complexity of forming an episode and apply the strategy to the multi-label few-shot problem.
Pre-Training Audio Representations With Self-Supervision
- Computer ScienceIEEE Signal Processing Letters
- 2020
This work proposes two self-supervised tasks: Audio2Vec, which aims at reconstructing a spectrogram slice from past and future slices and TemporalGap, which estimates the distance between two short audio segments extracted at random from the same audio clip.
Unsupervised Feature Learning via Non-parametric Instance Discrimination
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
This work forms this intuition as a non-parametric classification problem at the instance-level, and uses noise-contrastive estimation to tackle the computational challenges imposed by the large number of instance classes.
Data Augmenting Contrastive Learning of Speech Representations in the Time Domain
- Computer Science2021 IEEE Spoken Language Technology Workshop (SLT)
- 2021
WavAugment is intro-duce, a time-domain data augmentation library which is adapt and optimize for the specificities of CPC (raw waveform input, contrastive loss, past versus future structure), and finds that applying augmentation only to the segments from which the CPC prediction is performed yields better results.
Metric Learning with Background Noise Class for Few-Shot Detection of Rare Sound Events
- Computer ScienceICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2020
This paper aims to achieve few-shot detection of rare sound events, from query sequence that contain not only the target events but also the other events and background noise, and proposes metric learning with background noise class for the few- shot detection.