Adversarial Disentanglement of Speaker Representation for Attribute-Driven Privacy Preservation

@inproceedings{No2021AdversarialDO,
  title={Adversarial Disentanglement of Speaker Representation for Attribute-Driven Privacy Preservation},
  author={Paul-Gauthier No{\'e} and Mohammad MohammadAmini and Driss Matrouf and Titouan Parcollet and Jean-François Bonastre},
  booktitle={Interspeech},
  year={2021}
}
With the increasing interest over speech technologies, numerous Automatic Speaker Verification (ASV) systems are employed to perform person identification. In the latter context, the systems rely on neural embeddings as a speaker representation. Nonetheless, such representations may contain privacy sensitive information about the speakers (e.g. age, sex, ethnicity, ...). In this paper, we introduce the concept of attribute-driven privacy preservation that enables a person to hide one or a few… 

Figures and Tables from this paper

Protecting gender and identity with disentangled speech representations

This paper presents a novel way to encode gender information and disentangle two sensitive biometric identifiers, namely gender and identity, in a privacyprotecting setting by exploiting disentangled representation learning to encode information about different attributes into separate subspaces that can be factorised independently.

A Bridge between Features and Evidence for Binary Attribute-Driven Perfect Privacy

This work presents an approach based on normalizing flow that maps a feature vector into a latent space where the evidence, related to the binary attribute, and an independent residual are disentangled and allows to manipulate the log-likelihood-ratio of the data and therefore allows to set it to zero for privacy.

Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy

This work proposes to tackle the issue of non-existent voices by generating speaker embeddings using a generative adversarial network with Wasserstein distance as cost function and outperform previous approaches in terms of privacy and utility.

Generating gender-ambiguous voices for privacy-preserving speech recognition

It is shown that GenGAN improves the trade-off between privacy and utility compared to privacy-preserving representation learning methods that consider gender information as a sensitive attribute to protect.

To train or not to train adversarially: A study of bias mitigation strategies for speaker recognition

This paper systematically evaluates the biases present in speaker recognition systems with respect to gender across a range of system operating points and proposes adversarial and multi-task learning techniques to improve the fairness of these systems.

Beyond Voice Identity Conversion: Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations

A novel structured neural network is proposed in which multiple auto-encoders are used to encode speech as a set of idealistically independent linguistic and extra-linguistic representations, which are learned adversariarly and can be manipulated during VC.

Understanding the Tradeoffs in Client-side Privacy for Downstream Speech Tasks

This paper formally defines client-side privacy and discusses its unique technical challenges requiring 1) direct manipulation of raw data on client devices, 2) adaptability with a broad range of server-side processing models, and 3) low time and space complexity for compatibility with limited-bandwidth devices.

References

SHOWING 1-10 OF 41 REFERENCES

Privacy-Preserving Adversarial Representation Learning in ASR: Reality or Illusion?

The extent to which users can be recognized based on the encoded representation of their speech as obtained by a deep encoder-decoder architecture trained for ASR is studied and adversarial training is proposed to learn representations that perform well in ASR while hiding speaker identity.

Evaluating Voice Conversion-Based Privacy Protection against Informed Attackers

The results show that voice conversion schemes are unable to effectively protect against an attacker that has extensive knowledge of the type of conversion and how it has been applied, but may provide some protection against less knowledgeable attackers.

Speaker Anonymization Using X-vector and Neural Waveform Models

A new approach to speaker anonymization is presented, which exploits state-of-the-art x-vector speaker representations and uses them to derive anonymized pseudo speaker identities through the combination of multiple, random speaker x-vectors.

Preserving privacy in speaker and speech characterisation

Privacy-Preserving Speaker Authentication

This paper presents a new technique that employs secure binary embeddings of feature vectors, to perform voice authentication in a privacy preserving manner with minimal computational overhead and little loss of classification accuracy.

Disentangling Correlated Speaker and Noise for Speech Synthesis via Data Augmentation and Adversarial Factorization

Experimental results demonstrate that the proposed method can disentangle speaker and noise attributes even if they are correlated in the training data, and can be used to consistently synthesize clean speech for all speakers.

Robust Speaker Recognition Using Unsupervised Adversarial Invariance

This paper adopts a recently proposed unsupervised adversarial invariance architecture to train a network that maps speaker embeddings extracted using a pretrained model onto two lower dimensional embedding spaces to extract robust speaker-discriminative speech representations.

Disentangling Style Factors from Speaker Representations

The goal is to separate out speaking style from speaker identity in utterance-level representations of speech such as ivectors and x-vectors and to propose future work to use information theory to formalize style factors in the context of speaker identity.

Fader Networks: Manipulating Images by Sliding Attributes

A new encoder-decoder architecture that is trained to reconstruct images by disentangling the salient information of the image and the values of attributes directly in the latent space is introduced, which results in much simpler training schemes and nicely scales to multiple attributes.

VoxCeleb: A Large-Scale Speaker Identification Dataset

This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.