Zero-Shot Audio Classification Via Semantic Embeddings

@article{Xie2021ZeroShotAC,
  title={Zero-Shot Audio Classification Via Semantic Embeddings},
  author={Huang Xie and Tuomas Virtanen},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
  year={2021},
  volume={29},
  pages={1233-1242}
}
  • Huang Xie, T. Virtanen
  • Published 24 November 2020
  • Computer Science
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
In this paper, we study zero-shot learning in audio classification via semantic embeddings extracted from textual labels and sentence descriptions of sound classes. Our goal is to obtain a classifier that is capable of recognizing audio instances of sound classes that have no available training samples, but only semantic side information. We employ a bilinear compatibility framework to learn an acoustic-semantic projection between intermediate-level representations of audio instances and sound… 
Zero-Shot Audio Classification with Factored Linear and Nonlinear Acoustic-Semantic Projections
TLDR
Experimental results show that the proposed projection methods are effective for improving classification performance of zero-shot learning in audio classification.
Wikitag: Wikipedia-Based Knowledge Embeddings Towards Improved Acoustic Event Classification
TLDR
This paper describes how to extract label embeddings from multiple Wikipedia texts using a POS tagging based workflow, and forms the multi-view aligned AEC problem based on the VGGish model and AudioSet data.
AudioCLIP: Extending CLIP to Image, Text and Audio
TLDR
The proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset, which enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP’s ability to generalize to unseen datasets in a zero-shot inference fashion.
Zero-Shot Audio Classification using Image Embeddings
TLDR
It is demonstrated that the image embeddings can be used as semantic information to perform zero-shot audio classification and the classification performance is highly sensitive to the semantic relation between test and training classes and textual and image embeds can reach up to the semantics of the semantic acoustic embeds when the seen and unseen classes are semantically similar.
Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
TLDR
A (generalised) zero-shot learning benchmark is introduced on three audiovisual datasets of varying sizes and difficulty, VGGSound, UCF, and ActivityNet, ensuring that the unseen test classes do not appear in the dataset used for supervised training of the backbone deep models.
Guided Generative Adversarial Neural Network for Representation Learning and Audio Generation Using Fewer Labelled Audio Data
TLDR
This paper proposes a novel GAN-based model that is named Guided Generative Adversarial Neural Network (GGAN), which can learn powerful representations and generate good-quality samples using a small amount of labelled data as guidance.
GPLA-12: An Acoustic Signal Dataset of Gas Pipeline Leakage
TLDR
A new acoustic leakage dataset of gas pipelines, called as GPLA-12, which has 12 categories over 684 training/testing acoustic signals, is introduced, which dedicates to serve as a feature learning dataset for time-series tasks and classifications.

References

SHOWING 1-10 OF 39 REFERENCES
Zero-Shot Audio Classification Based On Class Label Embeddings
  • Huang Xie, T. Virtanen
  • Computer Science
    2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
  • 2019
TLDR
An audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class labels as input, and measures the compatibility between anaudio feature embedding and a class label embedding.
Zero-shot Learning for Audio-based Music Classification and Tagging
TLDR
This work investigates the zero-shot learning in the music domain and organizes two different setups of side information using human-labeled attribute information based on Free Music Archive and OpenMIC-2018 datasets and general word semantic information from Million Song Dataset and this http URL tag annotations.
SoundSemantics: Exploiting Semantic Knowledge in Text for Embedded Acoustic Event Classification
  • Md Tamzeed Islam, S. Nirjon
  • Computer Science
    2019 18th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN)
  • 2019
TLDR
A generic mobile application for audio event detection that is able to recognize all types of sounds (at varying level of accuracy depending on the number of classes that do not have any training examples), which is not achievable by any existing audio classifier.
Zero-Shot Learning via Semantic Similarity Embedding
In this paper we consider a version of the zero-shot learning problem where seen class source and target domain data are provided. The goal during test-time is to accurately predict the class label
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Synthesized Classifiers for Zero-Shot Learning
TLDR
This work introduces a set of "phantom" object classes whose coordinates live in both the semantic space and the model space and demonstrates superior accuracy of this approach over the state of the art on four benchmark datasets for zero-shot learning.
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Label-Embedding for Image Classification
TLDR
This work proposes to view attribute-based image classification as a label-embedding problem: each class is embedded in the space of attribute vectors, and introduces a function that measures the compatibility between an image and a label embedding.
Latent Embeddings for Zero-Shot Classification
TLDR
A novel latent embedding model for learning a compatibility function between image and class embeddings, in the context of zero-shot classification, that improves the state-of-the-art for various classembeddings consistently on three challenging publicly available datasets for the zero- shot setting.
Recent Advances in Zero-Shot Recognition: Toward Data-Efficient Understanding of Visual Content
TLDR
This article provides a comprehensive review of existing zero-shot recognition techniques covering various aspects ranging from representations of models, data sets, and evaluation settings and highlights the limitations of existing approaches.
...
...