• Corpus ID: 256416560

The Efficacy of Self-Supervised Speech Models for Audio Representations

  title={The Efficacy of Self-Supervised Speech Models for Audio Representations},
  author={Tung-Yu Wu and Chen-An Li and Tzu-Han Lin and Tsung-Yuan Hsu and Hung-yi Lee},
Self-supervised learning (SSL) speech models, which can serve as powerful upstream models to extract meaningful speech representations, have achieved unprecedented success in speech representation learning. However, their effectiveness on non-speech datasets is rela-tively less explored. In this work, we propose an ensemble framework, with a combination of ensemble techniques, to fuse SSL speech models’ embeddings. Extensive experiments on speech and non-speech audio datasets are conducted to… 

Figures and Tables from this paper



Wav2CLIP: Learning Robust Audio Representations from Clip

Wav2CLIP is proposed, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP), and is more efficient to pre-train than competing methods as it does not require learning a visual model in concert with an auditory model.

Efficient Training of Audio Transformers with Patchout

This work proposes a novel method to op-timize and regularize transformers on audio spectrograms with a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU.

Neural Audio Synthesis of Musical Notes with WaveNet Autoencoders

A powerful new WaveNet-style autoencoder model is detailed that conditions an autoregressive decoder on temporal codes learned from the raw audio waveform, and NSynth, a large-scale and high-quality dataset of musical notes that is an order of magnitude larger than comparable public datasets is introduced.

ESC: Dataset for Environmental Sound Classification

A new annotated collection of 2000 short clips comprising 50 classes of various common sound events, and an abundant unified compilation of 250000 unlabeled auditory excerpts extracted from recordings available through the Freesound project are presented.

SERAB: A Multi-Lingual Benchmark for Speech Emotion Recognition

The proposed Speech Emotion Recognition Adaptation Benchmark (SERAB) is a framework for evaluating the performance and generalization capacity of different approaches for utterance-level SER, and a selection of standard hand-crafted feature sets and state-of-the-art DNN representations are evaluated.

VOXLINGUA107: A Dataset for Spoken Language Recognition

This paper generates semi-random search phrases from language-specific Wikipedia data that are then used to retrieve videos from YouTube for 107 languages and uses the data to build language recognition models for several spoken language identification tasks.

Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning

This work introduces Bootstrap Your Own Latent (BYOL), a new approach to self-supervised image representation learning that performs on par or better than the current state of the art on both transfer and semi- supervised benchmarks.

PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition

This paper proposes pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset, and investigates the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks.

Vocal Imitation Set: a dataset of vocally imitated sound events using the AudioSet ontology

This work presents Vocal Imitation Set, a new vocal imitation dataset containing 11, 242 crowd-sourced vocal imitations of 302 sound event classes in the AudioSet sound event ontology, and provides an example of using the dataset to measure how well the existing state-of-the-art in QBV search performs on fine-grained search.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.