Semi-supervised Triplet Loss Based Learning of Ambient Audio Embeddings

@article{Turpault2019SemisupervisedTL,
  title={Semi-supervised Triplet Loss Based Learning of Ambient Audio Embeddings},
  author={Nicolas Turpault and Romain Serizel and Emmanuel Vincent},
  journal={ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2019},
  pages={760-764}
}
Deep neural networks are particularly useful to learn relevant representations from data. Recent studies have demonstrated the potential of unsupervised representation learning for ambient sound analysis using various flavors of the triplet loss. They have compared this approach to supervised learning. However, in real situations, it is common to have a small labeled dataset and a large unlabeled one. In this paper, we combine unsupervised and supervised triplet loss based learning into a semi… 

Figures and Tables from this paper

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags
TLDR
The results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.
Deep Ranking-Based Sound Source Localization
TLDR
A novel weakly-supervised deep-learning localization method that exploits only a few labeled (anchor) samples with known positions, together with a larger set of unlabeled samples, for which the authors only know their relative physical ordering.
Guided Learning for the combination of weakly-supervised and semi-supervised learning
TLDR
This work presents an end-to-end semi-supervised learning process termed Guided Learning for these two different models to improve the training efficiency and presents a new approach which outperforms the first place result on DCASE2018 Task 4 which employs Mean Teacher with a well-design CRNN network.
Tricycle: Audio Representation Learning from Sensor Network Data Using Self-Supervision
TLDR
A model for learning audio representations by predicting the long-term, cyclic temporal structure in audio data collected from an urban acoustic sensor network is presented and the utility of the learned audio representation in an urban sound event detection task with limited labeled data is demonstrated.
Unsupervised Scalable Representation Learning for Multivariate Time Series
TLDR
This paper combines an encoder based on causal dilated convolutions with a novel triplet loss employing time-based negative sampling, obtaining general-purpose representations for variable length and multivariate time series.
Metric Learning with Background Noise Class for Few-Shot Detection of Rare Sound Events
TLDR
This paper aims to achieve few-shot detection of rare sound events, from query sequence that contain not only the target events but also the other events and background noise, and proposes metric learning with background noise class for the few- shot detection.
A Monte Carlo Search-Based Triplet Sampling Method for Learning Disentangled Representation of Impulsive Noise on Steering Gear
TLDR
This paper proposes a method to overcome the above two major hurdles by modify a sampling algorithm of triplet pairs based on structural similarity index instead of naive Euclidean distance within Monte Carlo based sampling strategy.
At the Speed of Sound: Efficient Audio Scene Classification
TLDR
This work proposes a retrieval-based scene classification architecture that combines recurrent neural networks and attention to compute embeddings for short audio segments that can discriminate audio scenes with high accuracy after listening in for less than a second.
Model for Practice Badminton Basic Skills by using Motion Posture Detection from Video Posture Embedding and One-Shot Learning Technique
TLDR
The model for Practice Badminton Basic Skills is proposed, the video posture embedding is created by using the Triplet-Loss technique and the badminton player's motion posture detection is developed by use of the One-Shot Learning technique.
Enhancing Audio Augmentation Methods with Consistency Learning
TLDR
It is shown empirically that certain measures of consistency are not implicitly captured by the cross-entropy loss, and that incorporating such measures into the loss function can improve the performance of tasks such as audio tagging.
...
1
2
...

References

SHOWING 1-10 OF 29 REFERENCES
Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging
TLDR
A shrinking deep neural network (DNN) framework incorporating unsupervised feature learning to handle the multilabel classification task and a symmetric or asymmetric deep denoising auto-encoder (syDAE or asyDAE) to generate new data-driven features from the logarithmic Mel-filter banks features.
Unsupervised Learning of Semantic Audio Representations
  • A. Jansen, M. Plakal, R. Saurous
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
This work considers several class-agnostic semantic constraints that apply to unlabeled nonspeech audio and proposes low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively.
Realistic Evaluation of Deep Semi-Supervised Learning Algorithms
TLDR
This work creates a unified reimplemention and evaluation platform of various widely-used SSL techniques and finds that the performance of simple baselines which do not use unlabeled data is often underreported, that SSL methods differ in sensitivity to the amount of labeled and unlabeling data, and that performance can degrade substantially when the unlabelED dataset contains out-of-class examples.
Acoustic classification using semi-supervised Deep Neural Networks and stochastic entropy-regularization over nearest-neighbor graphs
  • S. Thulasidasan, J. Bilmes
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
TLDR
Results indicate that the graph-based semi-supervised learning method for acoustic data significantly improves classification accuracy compared to the fully- supervised case when the fraction of labeled data is low, and it is competitive with other methods in the fully labeled case.
Unsupervised feature learning for audio classification using convolutional deep belief networks
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning
Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network
In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly
Training general-purpose audio tagging networks with noisy labels and iterative self-verification
This paper describes our submission to the first Freesound generalpurpose audio tagging challenge carried out within the DCASE 2018 challenge. Our proposal is based on a fully convolutional neural
Deep ranking: Triplet MatchNet for music metric learning
TLDR
A deep neural network named Triplet MatchNet is proposed to learn metrics directly from raw audio signals of triplets of music excerpts with human-annotated relative similarity in a supervised fashion and significantly outperforms three state-of-the-art music metric learning methods.
MEAN TEACHER CONVOLUTION SYSTEM FOR DCASE 2018 TASK 4
TLDR
A mean-teacher model with context-gating convolutional neural network (CNN) and recurrent neuralnetwork (RNN) to maximize the use of unlabeled in-domain dataset is proposed.
Simple Triplet Loss Based on Intra/Inter-Class Metric Learning for Face Verification
TLDR
Experimental evaluations on the most widely used benchmarks LFW and YTF show that the model with the proposed class-wise simple triplet loss can reach the state-of-the-art performance.
...
1
2
3
...