• Corpus ID: 202565583

Pr$\epsilon\epsilon$ch: A System for Privacy-Preserving Speech Transcription

@inproceedings{Ahmed2019PrepsilonepsilonchAS,
  title={Pr\$\epsilon\epsilon\$ch: A System for Privacy-Preserving Speech Transcription},
  author={Shimaa Ahmed and Amrita Roy Chowdhury and Kassem Fawaz and Parmesh Ramanathan},
  year={2019}
}
New advances in machine learning and the abundance of speech datasets have made Automated Speech Recognition (ASR) systems, with very high accuracy, a reality. ASR systems offer their users the means to transcribe speech data at scale. Unfortunately, these systems pose serious privacy threats as speech is a rich source of sensitive acoustic and textual information. Although offline ASR eliminates the privacy risks, we find that its transcription performance is inferior to that of cloud-based… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 58 REFERENCES

VoiceGuard: Secure and Private Speech Processing

The proposed VoiceGuard architecture preserves the privacy of users while at the same time it does not require the service provider to reveal model parameters, and generalizes to secure on-premise solutions, allowing vendors to securely ship their models to customers.

Deep Speech: Scaling up end-to-end speech recognition

Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.

Privacy-preserving speech processing: cryptographic and string-matching frameworks show promise

This work states that when a person uses a speech-based service such as a voice authentication system or a speech recognition service, they must grant the service complete access to their voice recordings, which exposes the user to abuse, with security, privacy and economic implications.

Emotionless: Privacy-Preserving Speech Analysis for Voice Assistants

A privacy-preserving intermediate layer between users and cloud services is proposed to sanitize the voice input and shows that identification of sensitive emotional state of the speaker is reduced by ~96 %.

You Talk Too Much: Limiting Privacy Exposure Via Voice Input

This paper introduces techniques for locally sanitizing voice inputs before they are transmitted to the cloud for processing, and shows that voice recognition and voice accumulation (that is, the accumulation of users' voices) are separable.

Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech?—A Dataset, Insights, and Challenges

  • G. Mysore
  • Physics, Computer Science
    IEEE Signal Processing Letters
  • 2015
It is argued that the goal of enhancing speech content such as voice overs, podcasts, demo videos, lecture videos, and audio stories should not only be to make it sound cleaner as would be done using traditional speech enhancement techniques, but tomake it sound like it was recorded and produced in a professional recording studio.

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.

Towards Directly Modeling Raw Speech Signal for Speaker Verification Using CNNS

Inspired by the success of neural network-based approaches to model directly raw speech signal for applications such as speech recognition, emotion recognition and anti-spoofing, a speaker verification approach where speaker discriminative information is directly learned from the speech signal is proposed.

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

Fooling End-To-End Speaker Verification With Adversarial Examples

This paper presents white-box attacks on a deep end-to-end network that was either trained on YOHO or NTIMIT, and shows that one can significantly decrease the accuracy of a target system even when the adversarial examples are generated with different system potentially using different features.
...