• Corpus ID: 249191234

HEAR: Holistic Evaluation of Audio Representations

  title={HEAR: Holistic Evaluation of Audio Representations},
  author={Joseph P. Turian and Jordie Shier and Humair Raj Khan and Bhiksha Raj and Bj{\"o}rn Schuller and C. Steinmetz and Colin Malloy and George Tzanetakis and Gissel Velarde and Kirk McNally and Max Henry and Nicolas Pinto and Camille Noufi and Christian Clough and Dorien Herremans and Eduardo Fonseca and Jesse Engel and Justin Salamon and Philippe Esling and Pranay Manocha and Shinji Watanabe and Zeyu Jin and Yonatan Bisk},
What audio embedding approach generalizes best to a wide range of downstream tasks across a variety of everyday domains without fine-tuning? The aim of the HEAR benchmark is to develop a general-purpose audio representation that provides a strong basis for learning in a wide variety of tasks and scenarios. HEAR evaluates audio representations using a benchmark suite across a variety of domains, including speech, environmental sound, and music. HEAR was launched as a NeurIPS 2021 shared challenge… 

Figures and Tables from this paper


Towards Learning Universal Audio Representations
A holistic audio representation evaluation suite (HARES) spanning 12 downstream tasks across audio domains is introduced and a novel normalizer-free Slowfast NFNet is proposed to achieve state-of-the-art performance across all domains.
BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning
Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings
This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.
LEAF: A Learnable Frontend for Audio Classification
This work introduces a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks, and outperforms the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters.
Contrastive Learning of General-Purpose Audio Representations
This work builds on top of recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-to-implement self-supervised model of audio, and shows that despite its simplicity, this method significantly outperforms previous self- supervised systems.
Multimodal Self-Supervised Learning of General Audio Representations
This work demonstrates that their contrastive framework does not require high resolution images to learn good audio features, and is advantageous on a broad range of non-semantic audio tasks, including speaker identification, keyword spotting, language identification, and music instrument classification.
A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark
Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual
Learning Audio Representations with MLPs
In this paper, we propose an efficient MLP-based approach for learning audio representations, namely timestamp and scene-level audio embeddings. We use an encoder consisting of sequentially stacked
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.
BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping
This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the importance of using different pre-training datasets to develop general-purpose audio representations.