Zero-Shot Audio Classification with Factored Linear and Nonlinear Acoustic-Semantic Projections

@article{Xie2021ZeroShotAC,
  title={Zero-Shot Audio Classification with Factored Linear and Nonlinear Acoustic-Semantic Projections},
  author={Huang Xie and Okko Johannes R{\"a}s{\"a}nen and Tuomas Virtanen},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={326-330}
}
  • Huang Xie, O. Räsänen, T. Virtanen
  • Published 25 November 2020
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
In this paper, we study zero-shot learning in audio classification through factored linear and nonlinear acoustic-semantic projections between audio instances and sound classes. Zero-shot learning in audio classification refers to classification problems that aim at recognizing audio instances of sound classes, which have no available training data but only semantic side information. In this paper, we address zero-shot learning by employing factored linear and nonlinear acoustic-semantic… 

Tables from this paper

Zero-Shot Audio Classification using Image Embeddings
TLDR
It is demonstrated that the image embeddings can be used as semantic information to perform zero-shot audio classification and the classification performance is highly sensitive to the semantic relation between test and training classes and textual and image embeds can reach up to the semantics of the semantic acoustic embeds when the seen and unseen classes are semantically similar.
Wikitag: Wikipedia-Based Knowledge Embeddings Towards Improved Acoustic Event Classification
TLDR
This paper describes how to extract label embeddings from multiple Wikipedia texts using a POS tagging based workflow, and forms the multi-view aligned AEC problem based on the VGGish model and AudioSet data.
Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
TLDR
A (generalised) zero-shot learning benchmark is introduced on three audiovisual datasets of varying sizes and difficulty, VGGSound, UCF, and ActivityNet, ensuring that the unseen test classes do not appear in the dataset used for supervised training of the backbone deep models.

References

SHOWING 1-10 OF 23 REFERENCES
Zero-Shot Audio Classification Via Semantic Embeddings
  • Huang Xie, T. Virtanen
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2021
TLDR
The goal is to obtain a classifier that is capable of recognizing audio instances of sound classes that have no available training samples, but only semantic side information, and to demonstrate that both label embeddings and sentence embeddeds are useful for zero-shot learning.
Zero-Shot Audio Classification Based On Class Label Embeddings
  • Huang Xie, T. Virtanen
  • Computer Science
    2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
  • 2019
TLDR
An audio classification system built on the bilinear model, which takes audio feature embeddings and semantic class labels as input, and measures the compatibility between anaudio feature embedding and a class label embedding.
Zero-shot Learning for Audio-based Music Classification and Tagging
TLDR
This work investigates the zero-shot learning in the music domain and organizes two different setups of side information using human-labeled attribute information based on Free Music Archive and OpenMIC-2018 datasets and general word semantic information from Million Song Dataset and this http URL tag annotations.
Latent Embeddings for Zero-Shot Classification
TLDR
A novel latent embedding model for learning a compatibility function between image and class embeddings, in the context of zero-shot classification, that improves the state-of-the-art for various classembeddings consistently on three challenging publicly available datasets for the zero- shot setting.
CNN architectures for large-scale audio classification
TLDR
This work uses various CNN architectures to classify the soundtracks of a dataset of 70M training videos with 30,871 video-level labels, and investigates varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on the authors' audio classification task, and larger training and label sets help up to a point.
SoundSemantics: Exploiting Semantic Knowledge in Text for Embedded Acoustic Event Classification
  • Md Tamzeed Islam, S. Nirjon
  • Computer Science
    2019 18th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN)
  • 2019
TLDR
A generic mobile application for audio event detection that is able to recognize all types of sounds (at varying level of accuracy depending on the number of classes that do not have any training examples), which is not achievable by any existing audio classifier.
Few-Shot Acoustic Event Detection Via Meta Learning
TLDR
This paper formulate few-shot AED problem and explores different ways of utilizing traditional supervised methods for this setting as well as a variety of meta-learning approaches, which are conventionally used to solve few- shot classification problem.
Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification
TLDR
It is shown that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a “shallow” dictionary learning model with augmentation.
Learning to Match Transient Sound Events Using Attentional Similarity for Few-shot Sound Recognition
TLDR
The proposed attentional similarity module can be plugged into any metric-based learning method for few-shot learning, allowing the resulting model to especially match related short sound events.
Environmental sound classification with convolutional neural networks
  • Karol J. Piczak
  • Computer Science
    2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP)
  • 2015
TLDR
The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches.
...
...