Natural Language Understanding with Privacy-Preserving BERT

  title={Natural Language Understanding with Privacy-Preserving BERT},
  author={Chen Qu and Weize Kong and Liu Yang and Mingyang Zhang and Michael Bendersky and Marc-Alexander Najork},
  journal={Proceedings of the 30th ACM International Conference on Information \& Knowledge Management},
Privacy preservation remains a key challenge in data mining and Natural Language Understanding (NLU). Previous research shows that the input text or even text embeddings can leak private information. This concern motivates our research on effective privacy preservation approaches for pretrained Language Models (LMs). We investigate the privacy and utility implications of applying dχ-privacy, a variant of Local Differential Privacy, to BERT fine-tuning in NLU applications. More importantly, we… 

Figures and Tables from this paper


BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Plausible Deniability for Privacy-Preserving Data Synthesis
This paper presents a criterion called plausible deniability that provides a formal privacy guarantee, notably for releasing sensitive datasets: an output record can be released only if a certain amount of input records are indistinguishable, up to a privacy parameter.
Calibrating Noise to Sensitivity in Private Data Analysis
The study is extended to general functions f, proving that privacy can be preserved by calibrating the standard deviation of the noise according to the sensitivity of the function f, which is the amount that any single argument to f can change its output.
  • 2020
  • 2020
  • 2020
Differentially Private Representation for NLP: Formal Guarantee and An Empirical Study on Privacy and Fairness
Experimental results on benchmark datasets under various parameter settings demonstrate that DPNR largely reduces privacy leakage without significantly sacrificing the main task performance.
Information Leakage in Embedding Models
This work develops three classes of attacks to systematically study information that might be leaked by embeddings, and extensively evaluates the attacks on various state-of-the-art embedding models in the text domain.
Language Models are Few-Shot Learners
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Privacy Risks of General-Purpose Language Models
This study presents the first systematic study on the privacy risks of 8 state-of-the-art language models with 4 diverse case studies and demonstrates the aforementioned privacy risks do exist and can impose practical threats to the application of general-purpose language models on sensitive data covering identity, genome, healthcare and location.