Corpus ID: 237532660

DDS: A new device-degraded speech dataset for speech enhancement

  title={DDS: A new device-degraded speech dataset for speech enhancement},
  author={Haoyu Li and Junichi Yamagishi},
  • Haoyu Li, J. Yamagishi
  • Published 16 September 2021
  • Computer Science, Engineering
  • ArXiv
A large and growing amount of speech content in real-life scenarios is being recorded on common consumer devices in uncontrolled environments, resulting in degraded speech quality. Transforming such low-quality device-degraded speech into high-quality speech is a goal of speech enhancement (SE). This paper introduces a new speech dataset, DDS, to facilitate the research on SE. DDS provides aligned parallel recordings of high-quality speech (recorded in professional studios) and a number of… Expand

Figures and Tables from this paper


Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech?—A Dataset, Insights, and Challenges
  • G. Mysore
  • Computer Science
  • IEEE Signal Processing Letters
  • 2015
It is argued that the goal of enhancing speech content such as voice overs, podcasts, demo videos, lecture videos, and audio stories should not only be to make it sound cleaner as would be done using traditional speech enhancement techniques, but tomake it sound like it was recorded and produced in a professional recording studio. Expand
Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model
Experimental results show that the proposed encoder-decoder neural network can generate a professional high-quality speech waveform when setting high- quality audio as the reference and improves speech enhancement performance compared with several state-of-the-art baseline systems. Expand
A scalable noisy speech dataset and online subjective test framework
A noisy speech dataset (MS-SNSD) that can scale to arbitrary sizes depending on the number of speakers, noise types, and Speech to Noise Ratio levels desired and an open-source evaluation methodology to evaluate the results subjectively at scale using crowdsourcing. Expand
Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech
Two different approaches for speech enhancement to train TTS systems are investigated, following conventional speech enhancement methods, and show that the second approach results in larger MCEP distortion but smaller F0 errors. Expand
Modern speech content creation tasks such as podcasts, video voice-overs, and audio books require studio-quality audio with full bandwidth and balanced equalization (EQ). These goals pose a challengeExpand
Speech enhancement based on deep denoising autoencoder
Experimental results show that adding depth of the DAE consistently increase the performance when a large training data set is given, and compared with a minimum mean square error based speech enhancement algorithm, the proposed denoising DAE provided superior performance on the three objective evaluations. Expand
A Regression Approach to Speech Enhancement Based on Deep Neural Networks
The proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general, and is effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods. Expand
Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs
A new model has been developed for use across a wider range of network conditions, including analogue connections, codecs, packet loss and variable delay, known as perceptual evaluation of speech quality (PESQ). Expand
Libri-Adapt: a New Speech Dataset for Unsupervised Domain Adaptation
A new dataset, Libri-Adapt, is introduced to support unsupervised domain adaptation research on speech recognition models, built on top of the LibriSpeech corpus, and spans 72 different domains that are representative of the challenging practical scenarios encountered by ASR models. Expand
An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise Maskers
  • J. Jensen, C. Taal
  • Computer Science
  • IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2016
It is shown that ESTOI can be interpreted in terms of an orthogonal decomposition of short-time spectrograms into intelligibility subspaces, i.e., a ranking of spectrogram features according to their importance to intelligibility. Expand