Unsupervised Learning of Semantic Audio Representations

@article{Jansen2018UnsupervisedLO,
  title={Unsupervised Learning of Semantic Audio Representations},
  author={Aren Jansen and Manoj Plakal and Ratheet Pandya and Daniel P. W. Ellis and Shawn Hershey and Jiayang Liu and R. Channing Moore and Rif A. Saurous},
  journal={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2018},
  pages={126-130}
}
  • A. Jansen, M. Plakal, R. Saurous
  • Published 6 November 2017
  • Computer Science
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are… 

Tables from this paper

Semi-supervised Triplet Loss Based Learning of Ambient Audio Embeddings
TLDR
This paper combines unsupervised and supervised triplet loss based learning into a semi-supervised representation learning approach, whereby the positive samples for those triplets whose anchors are unlabeled are obtained either by applying a transformation to the anchor, or by selecting the nearest sample in the training set.
Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags
TLDR
The results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.
Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings
TLDR
This paper investigates how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings, and shows that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key.
Wikitag: Wikipedia-Based Knowledge Embeddings Towards Improved Acoustic Event Classification
TLDR
To the author’s knowledge, this is the first work in the AEC domain on building large-scale label representations by leveraging Wikipedia data in a systematic fashion.
A Deep Residual Network for Large-Scale Acoustic Scene Analysis
TLDR
The task of training a multi-label event classifier directly from the audio recordings of AudioSet is studied and it is found that the models are able to localize audio events when a finer time resolution is needed.
Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision
  • A. Jansen, D. Ellis, R. Saurous
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
TLDR
A learning framework for sound representation and recognition that combines a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, a clustering objective that reflects the authors' need to impose categorical structure on their experiences, and a cluster-based active learning procedure that solicits targeted weak supervision to consolidate categories into relevant semantic classes is presented.
Grounding Spoken Words in Unlabeled Video
TLDR
Deep learning models that learn joint multi-modal embeddings in videos where the audio and visual streams are loosely synchronized are explored, and with weak supervision the authors see significant amounts of cross- modal learning.
Improving Universal Sound Separation Using Sound Classification
TLDR
This paper shows that semantic embeddings extracted from a sound classifier can be used to condition a separation network, providing it with useful additional information, and establishes a new state-of-the-art for universal sound separation.
Self-Supervised Learning from Automatically Separated Sound Scenes
TLDR
This paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning and finds that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone.
Unsupervised Learning of Deep Features for Music Segmentation
  • Matthew C. McCallum
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
TLDR
Unsupervised training of deep feature embeddings using convolutional neural networks (CNNs) is explored for music segmentation and is shown not only to significantly improve the performance of this algorithm, but obtain state of the art performance in unsupervisedMusic segmentation.
...
...

References

SHOWING 1-10 OF 31 REFERENCES
Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging
TLDR
A shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multilabel classification task and a symmetric or asymmetric deep denoising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-filter banks features.
Unsupervised Visual Representation Learning by Context Prediction
TLDR
It is demonstrated that the feature representation learned using this within-image context indeed captures visual similarity across images and allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset.
Joint Learning of Speaker and Phonetic Similarities with Siamese Networks
TLDR
It is found that the joined embeddings architectures succeed in effectively disentangling speaker from phoneme information, with around 10% errors for the matching tasks and embeds, and the results carry over in out-of-domain datasets, even beating the best results obtained with similar weakly supervised techniques.
SoundNet: Learning Sound Representations from Unlabeled Video
TLDR
This work proposes a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge, and suggests some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.
Unsupervised neural network based feature extraction using weak top-down constraints
TLDR
A novel unsupervised DNN-based feature extractor that can be trained without these resources in zero-resource settings is proposed, and pairs of isolated word examples of the same unknown type are found to provide weak top-down supervision.
Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection
TLDR
This work introduces a convolutional neural network (CNN) with a large input field for AED that significantly outperforms state of the art methods including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16% absolute improvement.
Visually Grounded Learning of Keyword Prediction from Untranscribed Speech
TLDR
This work uses an image-to-words multi-label visual classifier to tag images with soft textual labels, and then trains a neural network to map from the speech to these soft targets, and shows that the resulting speech system is able to predict which words occur in an utterance without seeing any parallel speech and text.
Phonetics embedding learning with side information
TLDR
It is shown that it is possible to learn an efficient acoustic model using only a small amount of easily available word-level similarity annotations, and the resulting model is shown to perform much better than raw speech features in an ABX minimal-pair discrimination task.
Context Encoders: Feature Learning by Inpainting
TLDR
It is found that a context encoder learns a representation that captures not just appearance but also the semantics of visual structures, and can be used for semantic inpainting tasks, either stand-alone or as initialization for non-parametric methods.
Weak top-down constraints for unsupervised acoustic model training
TLDR
A much weaker form of top-down supervision for use in place of transcripts and dictionaries in the zero resource setting is investigated, capable of improving model speaker independence by up to 57% relative over bottom-up training alone.
...
...