SESQA: Semi-Supervised Learning for Speech Quality Assessment

@article{Serr2021SESQASL,
  title={SESQA: Semi-Supervised Learning for Speech Quality Assessment},
  author={Joan Serr{\`a} and Jordi Pons and Santiago Pascual},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2021},
  pages={381-385}
}
  • J. Serrà, Jordi Pons, Santiago Pascual
  • Published 1 October 2020
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Automatic speech quality assessment is an important, transversal task whose progress is hampered by the scarcity of human annotations, poor generalization to unseen recording conditions, and a lack of flexibility of existing approaches. In this work, we tackle these problems with a semi-supervised learning approach, combining available annotations with programmatically generated data, and using 3 different optimization criteria together with 5 complementary auxiliary tasks. Our results show… 
CDPAM: Contrastive Learning for Perceptual Audio Similarity
TLDR
CDPAM is introduced –a metric that builds on and advances DPAM, and it is shown that adding this metric to existing speech synthesis and enhancement methods yields significant improvement, as measured by objective and subjective tests.
Perceptual Loss Based Speech Denoising with an Ensemble of Audio Pattern Recognition and Self-Supervised Models
TLDR
A generalized framework called Perceptual Ensemble Regularization Loss (PERL) built on the idea of perceptual losses is introduced and a critical observation that state-of-the-art Multi-Task weight learning methods cannot outperform hand tuning, perhaps due to challenges of domain mismatch and weak complementarity of losses.
Adversarial Auto-Encoding for Packet Loss Concealment
TLDR
This work proposes a non-autoregressive adversarial auto-encoder, namedPLAAE, to perform real-time PLC in the waveform domain, and highlights the superiority of PLAAE over two classic PLCs and two deep autoregressive models in terms of spectral and intonation reconstruction, perceptual quality, and intelligibility.
A Novel Method for Intelligibility Assessment of Nonlinearly Processed Speech in Spaces Characterized by Long Reverberation Times
TLDR
A method based on the STI method but modified in such a way that it makes it possible to employ it for the estimation of the performance of the nonlinear speech intelligibility enhancement method is proposed.
INTERSPEECH 2022 Audio Deep Packet Loss Concealment Challenge
TLDR
An overview of the PLC problem is given, and some classical approaches to PLC as well as recent work are introduced, and PLCMOS, a novel data-driven metric that can be used to quickly evaluate the performance PLC systems are introduced.
UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022
TLDR
This work presents the UTokyo-SaruLab mean opinion score (MOS) prediction system submitted to VoiceMOS Challenge 2022, which had the highest score on several metrics for both the main and OOD tracks.
AECMOS: A speech quality assessment metric for echo impairment
TLDR
A neural network model is developed to evaluate call quality degradation in two separate categories: echo and degradations from other sources and it is shown that the model is accurate as measured by correlation with human subjective quality ratings.
AQP: An Open Modular Python Platform for Objective Speech and Audio Quality Metrics
TLDR
AQP allows researchers to test and compare objective quality metrics helping to improve robustness, reproducibility and development speed, and is presented as an open-source, node-based, light-weight Python pipeline for audio quality assessment.
Dnsmos: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors
TLDR
This paper introduces a multi-stage self-teaching based perceptual objective metric that is designed to evaluate noise suppressors and generalizes well in challenging test conditions with a high correlation to human ratings.
InSE-NET: A Perceptually Coded Audio Quality Model based on CNN
TLDR
This study proposes a learnable neural network, entitled InSE-NET, with a backbone of Inception and Squeeze-and-Excitation modules to assess the perceived quality of coded audio at a 48 kHz sample rate, and demonstrates that synthetic data augmentation is capable of enhancing the prediction.
...
1
2
...

References

SHOWING 1-10 OF 65 REFERENCES
Novel deep autoencoder features for non-intrusive speech quality assessment
TLDR
Quantification of the experimental results suggests that proposed metric gives more accurate and correlated scores than an existing benchmark for objective, non-intrusive quality assessment metric ITU-T P.563 standard.
Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks
TLDR
Experiments show that the proposed improved self-supervised method can learn transferable, robust, and problem-agnostic features that carry on relevant information from the speech signal, such as speaker identity, phonemes, and even higher-level features such as emotional cues.
Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM
TLDR
This study proposes a novel end-to-end, non-intrusive speech quality evaluation model, termed Quality-Net, based on bidirectional long short-term memory, which has potential to be used in a wide variety of applications of speech signal processing.
Low-Complexity, Nonintrusive Speech Quality Assessment
TLDR
A low-complexity algorithm for monitoring the speech quality over a network that can be computed from commonly used speech-coding parameters without explicit distortion modeling is described.
AutoMOS: Learning a non-intrusive assessor of naturalness-of-speech
TLDR
It is demonstrated that the AutoMOS model can model human raters' mean opinion scores (MOS) of synthesized speech using a deep recurrent neural network whose inputs consist solely of a raw waveform.
WEnets: A Convolutional Framework for Evaluating Audio Waveforms
TLDR
A new convolutional framework for waveform evaluation, WEnets, is described and a Narrowband Audio Waveform Evaluation Network, or NAWEnet, is built using this framework and its straightforward architecture simplifies the interpretation of its inner workings.
MOSNet: Deep Learning based Objective Assessment for Voice Conversion
TLDR
Results confirm that the proposed deep learning-based assessment models could be used as a computational evaluator to measure the MOS of VC systems to reduce the need for expensive human rating.
Intrusive and Non-Intrusive Perceptual Speech Quality Assessment Using a Convolutional Neural Network
TLDR
A convolutional neural network is proposed to predict the perceived quality of speech with noise, reverberation, and distortions, both intrusively and non-intrusively, i.e., with and without a clean reference signal.
Learning Sound Event Classifiers from Web Audio with Noisy Labels
TLDR
Experiments suggest that training with large amounts of noisy data can outperform training with smaller amounts of carefully-labeled data, and it is shown that noise-robust loss functions can be effective in improving performance in presence of corrupted labels.
ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric
TLDR
The combined v3 release of ViS QOL and ViSQOLAudio provides improvements upon previous versions, in terms of both design and usage, and can be deployed beyond the research context into production usage.
...
1
2
3
4
5
...