Zero-Shot Audio Classification Via Semantic Embeddings

@article{Xie2021ZeroShotAC,
  title={Zero-Shot Audio Classification Via Semantic Embeddings},
  author={Huang Xie and Tuomas Virtanen},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2021},
  volume={29},
  pages={1233-1242}
}
  • Huang XieT. Virtanen
  • Published 24 November 2020
  • Computer Science
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
In this paper, we study zero-shot learning in audio classification via semantic embeddings extracted from textual labels and sentence descriptions of sound classes. Our goal is to obtain a classifier that is capable of recognizing audio instances of sound classes that have no available training samples, but only semantic side information. We employ a bilinear compatibility framework to learn an acoustic-semantic projection between intermediate-level representations of audio instances and sound… 

Zero-Shot Audio Classification with Factored Linear and Nonlinear Acoustic-Semantic Projections

Experimental results show that the proposed projection methods are effective for improving classification performance of zero-shot learning in audio classification.

Wikitag: Wikipedia-Based Knowledge Embeddings Towards Improved Acoustic Event Classification

To the author’s knowledge, this is the first work in the AEC domain on building large-scale label representations by leveraging Wikipedia data in a systematic fashion.

Audioclip: Extending Clip to Image, Text and Audio

  • A. GuzhovFederico RaueJörn HeesA. Dengel
  • Computer Science
    ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2022
An extension of the CLIP model that handles audio in addition to text and images that achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task and out-performs others by reaching accuracies of 97.15 % on ESC-50 and 90.07 % on UrbanSound8K.

Exploring Universal Sentence Encoders for Zero-shot Text Classification

Investigation revealed that USE struggles to perform well on data-sets with a large number of labels with high semantic overlaps, while topic-based classification works well for the same.

Data Augmentation and Deep Learning Methods in Sound Classification: A Systematic Review

The aim of this systematic literature review (SLR) is to identify and critically evaluate current research advancements with respect to small data and the use of data augmentation methods to increase

Wav2vec2-based Paralinguistic Systems to Recognise Vocalised Emotions and Stuttering

The experimental results demonstrated that wav2vec2 is an excellent tool for detecting the emotions behind vocalisations and recognising different types of stutterings and could be further improved by ensembling them with other models.

Searching For Loops And Sound Samples With Feature Learning

An active learning system designed to categorized sounds categorized as samples or loops is described, and the use of neural network feature extraction in the problem of retrieving subjectively interesting sounds from electronic music tracks is evaluated.

MuLan: A Joint Embedding of Music Audio and Natural Language

The first attempt at a new generation of acoustic models that link music audio directly to un-constrained natural language music descriptions, MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings and weakly-associated, free-form text annotations.

Improved Zero-Shot Audio Tagging & Classification with Patchout Spectrogram Transformers

This study sets out to investigate the effectiveness of self-attention-based audio embedding architectures for ZS learning, and compares the very recent patchout spectrogram transformer with two classic convolutional architectures.

Zero-Shot Audio Classification using Image Embeddings

It is demonstrated that the image embeddings can be used as semantic information to perform zero-shot audio classification and the classification performance is highly sensitive to the semantic relation between test and training classes and textual and image embeddeds can reach up to the semantics of the semantic acoustic embeddeds when the seen and unseen classes are semantically similar.

References

SHOWING 1-10 OF 39 REFERENCES

Zero-Shot Audio Classification Based On Class Label Embeddings

  • Huang XieT. Virtanen
  • Computer Science
    2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
  • 2019
An audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class labels as input, and measures the compatibility between anaudio feature embedding and a class label embedding.

Zero-shot Learning for Audio-based Music Classification and Tagging

This work investigates the zero-shot learning in the music domain and organizes two different setups of side information using human-labeled attribute information based on Free Music Archive and OpenMIC-2018 datasets and general word semantic information from Million Song Dataset and this http URL tag annotations.

SoundSemantics: Exploiting Semantic Knowledge in Text for Embedded Acoustic Event Classification

  • Md Tamzeed IslamS. Nirjon
  • Computer Science
    2019 18th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN)
  • 2019
A generic mobile application for audio event detection that is able to recognize all types of sounds (at varying level of accuracy depending on the number of classes that do not have any training examples), which is not achievable by any existing audio classifier.

Zero-Shot Learning via Semantic Similarity Embedding

In this paper we consider a version of the zero-shot learning problem where seen class source and target domain data are provided. The goal during test-time is to accurately predict the class label

Audio Set: An ontology and human-labeled dataset for audio events

The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.

Synthesized Classifiers for Zero-Shot Learning

This work introduces a set of "phantom" object classes whose coordinates live in both the semantic space and the model space and demonstrates superior accuracy of this approach over the state of the art on four benchmark datasets for zero-shot learning.

CNN architectures for large-scale audio classification

This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.

Label-Embedding for Image Classification

This work proposes to view attribute-based image classification as a label-embedding problem: each class is embedded in the space of attribute vectors, and introduces a function that measures the compatibility between an image and a label embedding.

Latent Embeddings for Zero-Shot Classification

A novel latent embedding model for learning a compatibility function between image and class embeddings, in the context of zero-shot classification, that improves the state-of-the-art for various classembeddings consistently on three challenging publicly available datasets for the zero- shot setting.

Recent Advances in Zero-Shot Recognition: Toward Data-Efficient Understanding of Visual Content

This article provides a comprehensive review of existing zero-shot recognition techniques covering various aspects ranging from representations of models, data sets, and evaluation settings and highlights the limitations of existing approaches.