Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging

  title={Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging},
  author={Yong Xu and Qiang Huang and Wenwu Wang and Peter Foster and Siddharth Sigtia and Philip J. B. Jackson and Mark D. Plumbley},
  journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing},
Environmental audio tagging aims to predict only the presence or absence of certain acoustic events in the interested acoustic scene. In this paper, we make contributions to audio tagging in two parts, respectively, acoustic modeling and feature learning. We propose to use a shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multilabel classification task. For the acoustic modeling, a large set of contextual frames of the chunk are fed into… 
Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging
A weakly supervised method to not only predict the tags but also indicate the temporal locations of the occurred acoustic events and the attention scheme is found to be effective in identifying the important frames while ignoring the unrelated frames.
Convolutional gated recurrent neural network incorporating spatial features for audio tagging
This paper proposes to use a convolutional neural network (CNN) to extract robust features from mel-filter banks, spectrograms or even raw waveforms for audio tagging to evaluate the proposed methods on Task 4 of the Detection and Classification of Acoustic Scenes and Events 2016 (DCASE 2016) challenge.
Meta learning based audio tagging
This paper describes the solution for the general-purpose audio tagging task, which belongs to one of the subtasks in the DCASE 2018 challenge, and proposes a meta learning-based ensemble method that can provide higher prediction accuracy and robustness with comparison to the single model.
Bag-of-Deep-Features: Noise-Robust Deep Feature Representations for Audio Analysis
This paper quantises deep feature representations of various in-the-wild audio data sets by comparing the efficacy of various feature spaces extracted from different fully connected deep neural networks to classify six real-world audio corpora and shows the suitability of quantising the deep representations for noisy in- the- wild audio data.
Unsupervised Learning of Semantic Audio Representations
  • A. Jansen, M. Plakal, R. Saurous
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.
Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly
Learning Audio Sequence Representations for Acoustic Event Classification
Semi-supervised Triplet Loss Based Learning of Ambient Audio Embeddings
This paper combines unsupervised and supervised triplet loss based learning into a semi-supervised representation learning approach, whereby the positive samples for those triplets whose anchors are unlabeled are obtained either by applying a transformation to the anchor, or by selecting the nearest sample in the training set.
A Fusion of Deep Convolutional Generative Adversarial Networks and Sequence to Sequence Autoencoders for Acoustic Scene Classification
A novel combination of features learnt using both a deep convolutional generative adversarial network (DCGAN) and a recurrent sequence to sequence autoencoder (S2SAE) to generate robust features for acoustic scene analysis.
A Comparison of Attention Mechanisms of Convolutional Neural Network in Weakly Labeled Audio Tagging
The results show that the performance of attention based on GLU is better than the performance on the SE block in CRNN for weakly labeled polyphonic audio tagging.


End-to-end learning for music audio
  • S. Dieleman, B. Schrauwen
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
Although convolutional neural networks do not outperform a spectrogram-based approach, the networks are able to autonomously discover frequency decompositions from raw audio, as well as phase-and translation-invariant feature representations.
Audio Event Detection using Weakly Labeled Data
It is shown that audio event detection using weak labels can be formulated as an Multiple Instance Learning problem and two frameworks for solving multiple-instance learning are suggested, one based on support vector machines, and the other on neural networks.
Spectral vs. spectro-temporal features for acoustic event detection
  • Courtenay V. Cotton, D. Ellis
  • Computer Science, Physics
    2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
  • 2011
This work proposes an approach to detecting and modeling acoustic events that directly describes temporal context, using convolutive non-negative matrix factorization (NMF), and discovers a set of spectro-temporal patch bases that best describe the data.
Learning Features from Music Audio with Deep Belief Networks
This work presents a system that can automatically extract relevant features from audio for a given task by using a Deep Belief Network on Discrete Fourier Transforms of the audio to solve the task of genre recognition.
Multiscale Approaches To Music Audio Feature Learning
Three approaches to multiscale audio feature learning using the spherical K-means algorithm are developed and compared and evaluated in an automatic tagging task and a similarity metric learning task on the Magnatagatune dataset.
CQT-based Convolutional Neural Networks for Audio Scene Classification
It is shown in this paper that a ConstantQ-transformed input to a Convolutional Neural Network improves results and a parallel (graphbased) neural network architecture is proposed which captures relevant audio characteristics both in time and in frequency.
Unsupervised content discovery in composite audio
Evaluations of the proposed unsupervised approach to discover and categorize semantic content in a composite audio stream indicate that promising results can be achieved, both regarding audio element discovery and auditory scene categorization.
An investigation of deep neural networks for noise robust speech recognition
The noise robustness of DNN-based acoustic models can match state-of-the-art performance on the Aurora 4 task without any explicit noise compensation and can be further improved by incorporating information about the environment into DNN training using a new method called noise-aware training.
An Analysis of Single-Layer Networks in Unsupervised Feature Learning
The results show that large numbers of hidden nodes and dense feature extraction are critical to achieving high performance—so critical, in fact, that when these parameters are pushed to their limits, they achieve state-of-the-art performance on both CIFAR-10 and NORB using only a single layer of features.
A Regression Approach to Speech Enhancement Based on Deep Neural Networks
The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods.