Zero-Shot Audio Classification Based On Class Label Embeddings

@article{Xie2019ZeroShotAC,
  title={Zero-Shot Audio Classification Based On Class Label Embeddings},
  author={Huang Xie and Tuomas Virtanen},
  journal={2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
  year={2019},
  pages={264-267}
}
  • Huang Xie, T. Virtanen
  • Published 6 May 2019
  • Computer Science
  • 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
This paper proposes a zero-shot learning approach for audio classification based on the textual information about class labels without any audio samples from target classes. We propose an audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class label embeddings as input, and measures the compatibility between an audio feature embedding and a class label embedding. We use VGGish to extract audio feature embeddings from audio recordings. We… 

Figures and Tables from this paper

Audio Captioning Based on Combined Audio and Semantic Embeddings
TLDR
Experimental results show that the proposed BiGRU-based deep model significantly outperforms the state of the art results across different evaluation metrics and inclusion of semantic information enhance the captioning performance.
Audio Captioning with Composition of Acoustic and Semantic Information
TLDR
This work presents a novel encoder-decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings that outperforms state-of-theart audio captioning models across different evaluation metrics and using the semantic information improves the captioning performance.
AudioCLIP: Extending CLIP to Image, Text and Audio
TLDR
The proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset, which enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP’s ability to generalize to unseen datasets in a zero-shot inference fashion.
The Delayed Recognition of Scientific Novelty
TLDR
This work identifies the novel paper with high atypicality, which models how research draws upon unusual combinations of prior research in crafting their own contributions, and evaluates recognition to novel papers by citation and disruption, which captures the degree to which a research article creates a new direction by eclipsing citations to the prior work it builds upon.
Zero-Shot Audio Classification Via Semantic Embeddings
  • Huang Xie, T. Virtanen
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2021
TLDR
The goal is to obtain a classifier that is capable of recognizing audio instances of sound classes that have no available training samples, but only semantic side information, and to demonstrate that both label embeddings and sentence embeddeds are useful for zero-shot learning.
Zero-Shot Audio Classification with Factored Linear and Nonlinear Acoustic-Semantic Projections
TLDR
Experimental results show that the proposed projection methods are effective for improving classification performance of zero-shot learning in audio classification.
An Open-set Recognition and Few-Shot Learning Dataset for Audio Event Classification in Domestic Environments
TLDR
This paper is aimed at providing the audio recognition community with a carefully annotated dataset for FSL and OSR comprised of 1360 clips from 34 classes divided into pattern sounds and unwanted sounds.
Audio Captioning using Gated Recurrent Units
TLDR
A novel deep network architecture with audio embeddings is presented to predict audio captions and the experimental results show that the proposed BiGRU-based deep model outperforms the state of the art results.
Rethinking CNN Models for Audio Classification
TLDR
It is shown that ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification and qualitative results of what the CNNs learn from the spectrograms by visualizing the gradients are shown.

References

SHOWING 1-10 OF 11 REFERENCES
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
Label-Embedding for Image Classification
TLDR
This work proposes to view attribute-based image classification as a label-embedding problem: each class is embedded in the space of attribute vectors, and introduces a function that measures the compatibility between an image and a label embedding.
Latent Embeddings for Zero-Shot Classification
TLDR
A novel latent embedding model for learning a compatibility function between image and class embeddings, in the context of zero-shot classification, that improves the state-of-the-art for various classembeddings consistently on three challenging publicly available datasets for the zero- shot setting.
Audio Set: An ontology and human-labeled dataset for audio events
TLDR
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers.
Recent Advances in Zero-shot Recognition
TLDR
This article provides a comprehensive review of existing zero-shot recognition techniques covering various aspects ranging from representations of models, and from datasets and evaluation settings and highlights the limitations of existing approaches.
Large scale image annotation: learning to rank with joint word-image embeddings
TLDR
This work proposes a strongly performing method that scales to image annotation datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations.
An embarrassingly simple approach to zero-shot learning
TLDR
This paper describes a zero-shot learning approach that can be implemented in just one line of code, yet it is able to outperform state of the art approaches on standard datasets.
Zero-shot Learning with Semantic Output Codes
TLDR
A semantic output code classifier which utilizes a knowledge base of semantic properties of Y to extrapolate to novel classes and can often predict words that people are thinking about from functional magnetic resonance images of their neural activity, even without training examples for those words.
ESC: Dataset for Environmental Sound Classification
TLDR
A new annotated collection of 2000 short clips comprising 50 classes of various common sound events, and an abundant unified compilation of 250000 unlabeled auditory excerpts extracted from recordings available through the Freesound project are presented.
Distributed Representations of Words and Phrases and their Compositionality
TLDR
This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
...
1
2
...