Corpus ID: 231632260

Red Alarm for Pre-trained Models: Universal Vulnerabilities by Neuron-Level Backdoor Attacks

  title={Red Alarm for Pre-trained Models: Universal Vulnerabilities by Neuron-Level Backdoor Attacks},
  author={Zhengyan Zhang and Guangxuan Xiao and Yongwei Li and Tian Lv and Fanchao Qi and Yasheng Wang and Xin Jiang and Zhiyuan Liu and Maosong Sun},
Pre-trained models (PTMs) have been widely used in various downstream tasks. The parameters of PTMs are distributed on the Internet and may suffer backdoor attacks. In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary downstream tasks. Specifically, attackers can add a simple pre-training task, which restricts the output representations of trigger instances to pre-defined vectors, namely neuron-level… Expand

Figures and Tables from this paper

Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models
This paper finds that it is possible to hack the model in a data-free way by modifying one single word embedding vector, with almost no accuracy sacrificed on clean samples. Expand
Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning
A stronger weight-poisoning attack method is proposed that introduces a layerwise weight poisoning strategy to plant deeper backdoors and introduces a combinatorial trigger that cannot be easily detected. Expand
Pre-Trained Models: Past, Present and Future
  • Xu Han, Zhengyan Zhang, +19 authors Jun Zhu
  • Computer Science
  • ArXiv
  • 2021
A deep look into the history of pretraining, especially its special relation with transfer learning and self-supervised learning, is taken to reveal the crucial position of PTMs in the AI development spectrum. Expand


BadNL: Backdoor Attacks Against NLP Models
This paper presents the first systematic investigation of the backdoor attack against models designed for natural language processing (NLP) tasks, and proposes three methods to construct triggers in the NLP setting, including Char-level, Word- level, and Sentence-level triggers. Expand
Weight Poisoning Attacks on Pretrained Models
It is shown that it is possible to construct “weight poisoning” attacks where pre-trained weights are injected with vulnerabilities that expose “backdoors” after fine-tuning, enabling the attacker to manipulate the model prediction simply by injecting an arbitrary keyword. Expand
Programmable Neural Network Trojan for Pre-Trained Feature Extractor
This paper proposes a more powerful trojaning attack method for both outsourced training attack and transfer learning attack, which outperforms existing studies in the capability, generality, and stealthiness. Expand
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
It is shown that outsourced training introduces new security risks: an adversary can create a maliciously trained network (a backdoored neural network, or a BadNet) that has state-of-the-art performance on the user's training and validation samples, but behaves badly on specific attacker-chosen inputs. Expand
Model-Reuse Attacks on Deep Learning Systems
It is demonstrated that malicious primitive models pose immense threats to the security of ML systems, and analytical justification for the effectiveness of model-reuse attacks is provided, which points to the unprecedented complexity of today's primitive models. Expand
Trojaning Attack on Neural Networks
A trojaning attack on neuron networks that can be successfully triggered without affecting its test accuracy for normal input data, and it only takes a small amount of time to attack a complex neuron network model. Expand
A Backdoor Attack Against LSTM-Based Text Classification Systems
A backdoor attack against LSTM-based text classification by data poisoning, where the adversary will inject backdoors into the model and then cause the misbehavior of the model through inputs including backdoor triggers. Expand
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment
TextFooler, a simple but strong baseline to generate natural adversarial text that outperforms state-of-the-art attacks in terms of success rate and perturbation rate, and is utility-preserving, which preserves semantic content and grammaticality and remains correctly classified by humans. Expand
Security Risks in Deep Learning Implementations
Risks caused by a set of vulnerabilities in popular deep learning frameworks including Caffe, TensorFlow, and Torch are considered by studying their impact on common deep learning applications such as voice recognition and image classification. Expand
Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder
Results show that a victim BERT finetuned classifier’s predictions can be steered to the poison target class with success rates of >80\% when the input hypothesis is injected with the poison signature, demonstrating that NLI and text classification systems face a huge security risk. Expand