CDPAM: Contrastive Learning for Perceptual Audio Similarity

@article{Manocha2021CDPAMCL,
  title={CDPAM: Contrastive Learning for Perceptual Audio Similarity},
  author={Pranay Manocha and Zeyu Jin and Richard Zhang and Adam Finkelstein},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={196-200}
}
  • Pranay ManochaZeyu Jin A. Finkelstein
  • Published 9 February 2021
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Many speech processing methods based on deep learning require an automatic and differentiable audio metric for the loss function. The DPAM approach of Manocha et al. [1] learns a full-reference metric trained directly on human judgments, and thus correlates well with human perception. However, it requires a large number of human annotations and does not generalize well outside the range of perturbations on which it was trained. This paper introduces CDPAM –a metric that builds on and advances… 

Figures and Tables from this paper

DPLM: A Deep Perceptual Spatial-Audio Localization Metric

This work proposes a framework for building a general-purpose quality metric to assess spatial localization differences between two binaural recordings, and model localization similarity by utilizing activation-level distances from deep networks trained for direction of arrival (DOA) estimation.

SQAPP: No-Reference Speech Quality Assessment Via Pairwise Preference

A learning framework for estimating the quality of a recording without any reference, and without any human judgments is proposed, which is a pairwise quality-preference strategy that reduces label noise, thereby making learning more robust.

NORESQA - A Framework for Speech Quality Assessment using Non-Matching References

This work proposes a novel framework that predicts a subjective relative quality score for the given speech signal with respect to any provided reference without using any subjective data and shows that neural networks trained using this framework produce scores that correlate well with subjective mean opinion scores (MOS) and are also competitive to methods such as DNSMOS.

Style Transfer of Audio Effects with Differentiable Signal Processing

This work presents a framework that can impose the audio effects and production style from one recording to another by example by example, and produces convincing production style transfer results with the ability to transform input recordings to produced recordings, yielding audio effect control parameters that enable interpretability and user interaction.

Audio Similarity is Unreliable as a Proxy for Audio Quality

It is concluded that similarity serves as an unreliable proxy for audio quality because they vary with clean references; rely on attributes that humans factor out when considering quality; and are sensitive to imperceptible signal level differences.

On the Compensation Between Magnitude and Phase in Speech Separation

A novel view from the perspective of the implicit compensation between estimated magnitude and phase of deep neural network based end-to-end optimization in the complex time-frequency (T-F) domain or time domain is provided.

Efficient Speech Quality Assessment using Self-supervised Framewise Embeddings

This paper proposes ancient system with results comparable to the best performing model in the ConferencingSpeech 2022 challenge, characterized by a smaller number of parameters, fewer FLOPS, lower memory consumption, and lower latency, and contributes to sustainable machine learning.

Speech Quality Assessment through MOS using Non-Matching References

This work presents a novel framework, NORESQA-MOS, for estimating the MOS of a speech signal, which provides better generalization and more robust MOS estimation than previous state-of-the-art methods such as DNSMOS and NISQA, even though it uses a smaller training set.

Impairment Representation Learning for Speech Quality Assessment

An impairment representation learning approach is proposed to pre-train the network on a large amount of simulated data without MOS annotation, then fine-tune the pre-trained model for the MOS prediction task on annotated data.

A Prototypical Network Approach for Evaluating Generated Emotional Speech

This work explores the use of a prototypical network to evaluate four classes of generated emotional audio, comparing similarity to the class prototype and diversity within the embedding space, and suggests that quality and diversity can be quantitatively observed with this approach.

References

SHOWING 1-10 OF 40 REFERENCES

A Differentiable Perceptual Audio Metric Learned from Just Noticeable Differences

This work constructs a metric by fitting a deep neural network to a new large dataset of crowdsourced human judgments and shows that the resulting learned metric is well-calibrated with human judgments, outperforming baseline methods.

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

The model is non-autoregressive, fully convolutional, with significantly fewer parameters than competing models and generalizes to unseen speakers for mel-spectrogram inversion, and suggests a set of guidelines to design general purpose discriminators and generators for conditional sequence synthesis tasks.

wav2vec: Unsupervised Pre-training for Speech Recognition

Wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training and outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.

A Simple Framework for Contrastive Learning of Visual Representations

It is shown that composition of data augmentations plays a critical role in defining effective predictive tasks, and introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning.

Learning Disentangled Representations for Timber and Pitch in Music Audio

This paper proposes two deep convolutional neural network models for learning disentangled representation of musical timbre and pitch and shows that the second model can better change the instrumentation of a multi-instrument music piece without much affecting the pitch structure.

The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework

A large clean speech and noise corpus is opened for training the noise suppression models and a representative test set to real-world scenarios consisting of both synthetic and real recordings and an online subjective test framework based on ITU-T P.808 for researchers to quickly test their developments.

MOSNet: Deep Learning based Objective Assessment for Voice Conversion

Results confirm that the proposed deep learning-based assessment models could be used as a computational evaluator to measure the MOS of VC systems to reduce the need for expensive human rating.

Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation

This work proposes Audio ALBERT, a lite version of the self-supervised speech representation model, and applies the lightweight representation extractor to two downstream tasks, speaker classification and phoneme classification, showing that it achieves performance comparable with massive pre-trained networks in the downstream tasks while having 91% fewer parameters.

SESQA: Semi-Supervised Learning for Speech Quality Assessment

This work tackles automatic speech quality assessment with a semi-supervised learning approach, combining available annotations with programmatically generated data, and using 3 different optimization criteria together with 5 complementary auxiliary tasks.

HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks

This paper introduces HiFi-GAN, a deep learning method to transform recorded speech to sound as though it had been recorded in a studio, trained with multi-scale adversarial discriminators in both the time domain and the time-frequency domain.