Unsupervised Discriminative Learning of Sounds for Audio Event Classification

  title={Unsupervised Discriminative Learning of Sounds for Audio Event Classification},
  author={Sascha Hornauer and Ke Li and Stella X. Yu and Shabnam Ghaffarzadegan and Liu Ren},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Sascha Hornauer, Ke Li, +2 authors Liu Ren
  • Published 19 May 2021
  • Computer Science, Engineering
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Recent progress in network-based audio event classification has shown the benefit of pre-training models on visual data such as ImageNet. While this process allows knowledge transfer across different domains, training a model on large-scale visual datasets is time consuming. On several audio event classification benchmarks, we show a fast and effective alternative that pre-trains the model unsupervised, only on audio data and yet delivers on-par performance with ImageNet pre-training… Expand

Figures and Tables from this paper


Unsupervised Feature Learning for Audio Analysis
An unsupervised feature learning method for exploration of audio data that incorporates the two following novel contributions: an audio frame predictor based on a Convolutional LSTM autoencoder and a training method for autoencoders, which leads to distinct features by amplifying event similarities. Expand
Unsupervised feature learning for audio classification using convolutional deep belief networks
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learningExpand
Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging
A shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multilabel classification task and a symmetric or asymmetric deep denoising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-filter banks features. Expand
Audio Set: An ontology and human-labeled dataset for audio events
The creation of Audio Set is described, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research and substantially stimulate the development of high-performance audio event recognizers. Expand
Unsupervised Feature Learning via Non-parametric Instance Discrimination
This work forms this intuition as a non-parametric classification problem at the instance-level, and uses noise-contrastive estimation to tackle the computational challenges imposed by the large number of instance classes. Expand
Unsupervised Learning of Spoken Language with Visual Context
A deep neural network model capable of rudimentary spoken language acquisition using untranscribed audio training data, whose only supervision comes in the form of contextually relevant visual images, is presented. Expand
Momentum Contrast for Unsupervised Visual Representation Learning
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and aExpand
Look, Listen and Learn
There is a valuable, but so far untapped, source of information contained in the video itself – the correspondence between the visual and the audio streams, and a novel “Audio-Visual Correspondence” learning task that makes use of this. Expand
Audio Word2Vec: Unsupervised Learning of Audio Segment Representations Using Sequence-to-Sequence Autoencoder
This paper proposes unsupervised learning of Audio Word2Vec from audio data without human annotation using Sequence-to-sequence Audoencoder (SA), which significantly outperformed the conventional Dynamic Time Warping (DTW) based approaches at significantly lower computation requirements. Expand
Objects that Sound
New network architectures are designed that can be trained using the AVC task for these functionalities: for cross-modal retrieval, and for localizing the source of a sound in an image. Expand