Text Revealer: Private Text Reconstruction via Model Inversion Attacks against Transformers

  title={Text Revealer: Private Text Reconstruction via Model Inversion Attacks against Transformers},
  author={Ruisi Zhang and Seira Hidano and Farinaz Koushanfar},
Text classification has become widely used in various natural language processing applica-tions like sentiment analysis. Current appli-cations often use large transformer-based language models to classify input texts. How-ever, there is a lack of systematic study on how much private information can be inverted when publishing models. In this paper, we for-mulate Text Revealer — the first model inversion attack for text reconstruction against text classification with transformers. Our attacks… 

Figures and Tables from this paper



The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks

It is theoretically prove that a model's predictive power and its vulnerability to inversion attacks are indeed two sides of the same coin, and highly predictive models are able to establish a strong correlation between features and labels, which coincides exactly with what an adversary exploits to mount the attacks.

Reconstruction Attack on Instance Encoding for Language Understanding

A novel reconstruction attack to break TextHide by recovering the private training data, and thus unveil the privacy risks of instance encoding is proposed and experimentally validated the effectiveness of the reconstruction attack with two commonly-used datasets for sentence classification.

Variational Model Inversion Attacks

A probabilistic interpretation of model inversion attacks is provided, and a variational objective is formulated that accounts for both diversity and accuracy in the code space of a deep generative model trained on a public auxiliary dataset.

Label-Only Model Inversion Attacks via Boundary Repulsion

This paper introduces an algorithm, Boundary-Repelling Model Inversion (BREP-MI), to invert private training data using only the target model's predicted labels, which outperforms the state-of-the-art white-box and blackbox model inversion attacks.

TAG: Gradient Attack on Transformer-based Language Models

This paper forms the gradient attack problem on the Transformer-based language models and proposes a gradient attack algorithm, TAG, to recover the local training data and shows that compared with DLG (Zhu et al., 2019), TAG works well on more weight distributions in recovering private training dataand is stronger than previous approaches on larger models, smaller dictionary size, and smaller input length.

Extracting Training Data from Large Language Models

This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model, and finds that larger models are more vulnerable than smaller models.

Plug and Play Language Models: A Simple Approach to Controlled Text Generation

The Plug and Play Language Model (PPLM) for controllable language generation is proposed, which combines a pretrained LM with one or more simple attribute classifiers that guide text generation without any further training of the LM.

Privacy Risks of General-Purpose Language Models

This study presents the first systematic study on the privacy risks of 8 state-of-the-art language models with 4 diverse case studies and demonstrates the aforementioned privacy risks do exist and can impose practical threats to the application of general-purpose language models on sensitive data covering identity, genome, healthcare and location.

Adversarial Neural Network Inversion via Auxiliary Knowledge Alignment

This work investigates the model inversion problem in the adversarial settings, where the adversary aims at inferring information about the target model's training data and test data from the model's prediction values, and develops a solution to train a second neural network that acts as the inverse of thetarget model to perform the inversion.

Deep Leakage from Gradients

This work shows that it is possible to obtain the private training data from the publicly shared gradients, and names this leakage as Deep Leakage from Gradient and empirically validate the effectiveness on both computer vision and natural language processing tasks.