Corpus ID: 231632260

Red Alarm for Pre-trained Models: Universal Vulnerabilities by Neuron-Level Backdoor Attacks

@article{Zhang2021RedAF,
  title={Red Alarm for Pre-trained Models: Universal Vulnerabilities by Neuron-Level Backdoor Attacks},
  author={Zhengyan Zhang and Guangxuan Xiao and Yongwei Li and Tian Lv and Fanchao Qi and Yasheng Wang and Xin Jiang and Zhiyuan Liu and Maosong Sun},
  journal={ArXiv},
  year={2021},
  volume={abs/2101.06969}
}
Pre-trained models (PTMs) have been widely used in various downstream tasks. The parameters of PTMs are distributed on the Internet and may suffer backdoor attacks. In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary downstream tasks. Specifically, attackers can add a simple pre-training task, which restricts the output representations of trigger instances to pre-defined vectors, namely neuron-level… Expand

Figures and Tables from this paper

Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models
TLDR
This paper finds that it is possible to hack the model in a data-free way by modifying one single word embedding vector, with almost no accuracy sacrificed on clean samples. Expand
Backdoor Pre-trained Models Can Transfer to All
  • Lujia Shen, Shouling Ji, +6 authors Ting Wang
  • Computer Science
  • CCS
  • 2021
TLDR
A new approach to map the inputs containing triggers directly to a predefined output representation of the pre-trained NLP models, e.g., a preddefined output representation for the classification token in BERT, instead of a target label, which can introduce backdoor to a wide range of downstream tasks without any prior knowledge. Expand
RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models
TLDR
This work constructs a word-based robustness-aware perturbation to distinguish poisoned samples from clean samples to defend against the backdoor attacks on natural language processing (NLP) models. Expand
Backdoor Attacks on Pre-trained Models by Layerwise Weight Poisoning
TLDR
A stronger weight-poisoning attack method is proposed that introduces a layerwise weight poisoning strategy to plant deeper backdoors and introduces a combinatorial trigger that cannot be easily detected. Expand
Pre-Trained Models: Past, Present and Future
  • Xu Han, Zhengyan Zhang, +19 authors Jun Zhu
  • Computer Science
  • AI Open
  • 2021
TLDR
A deep look into the history of pretraining, especially its special relation with transfer learning and self-supervised learning, is taken to reveal the crucial position of PTMs in the AI development spectrum. Expand
Rethinking Stealthiness of Backdoor Attack against NLP Models
TLDR
A novel word-based backdoor attacking method based on negative data augmentation and modifying word embeddings is proposed, making an important step towards achieving stealthy backdoor attacking. Expand

References

SHOWING 1-10 OF 53 REFERENCES
BadNL: Backdoor Attacks Against NLP Models
TLDR
This paper presents the first systematic investigation of the backdoor attack against models designed for natural language processing (NLP) tasks, and proposes three methods to construct triggers in the NLP setting, including Char-level, Word- level, and Sentence-level triggers. Expand
Weight Poisoning Attacks on Pretrained Models
TLDR
It is shown that it is possible to construct “weight poisoning” attacks where pre-trained weights are injected with vulnerabilities that expose “backdoors” after fine-tuning, enabling the attacker to manipulate the model prediction simply by injecting an arbitrary keyword. Expand
Programmable Neural Network Trojan for Pre-Trained Feature Extractor
TLDR
This paper proposes a more powerful trojaning attack method for both outsourced training attack and transfer learning attack, which outperforms existing studies in the capability, generality, and stealthiness. Expand
BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain
TLDR
It is shown that outsourced training introduces new security risks: an adversary can create a maliciously trained network (a backdoored neural network, or a BadNet) that has state-of-the-art performance on the user's training and validation samples, but behaves badly on specific attacker-chosen inputs. Expand
Model-Reuse Attacks on Deep Learning Systems
TLDR
It is demonstrated that malicious primitive models pose immense threats to the security of ML systems, and analytical justification for the effectiveness of model-reuse attacks is provided, which points to the unprecedented complexity of today's primitive models. Expand
Trojaning Attack on Neural Networks
TLDR
A trojaning attack on neuron networks that can be successfully triggered without affecting its test accuracy for normal input data, and it only takes a small amount of time to attack a complex neuron network model. Expand
A Backdoor Attack Against LSTM-Based Text Classification Systems
TLDR
A backdoor attack against LSTM-based text classification by data poisoning, where the adversary will inject backdoors into the model and then cause the misbehavior of the model through inputs including backdoor triggers. Expand
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment
TLDR
TextFooler, a simple but strong baseline to generate natural adversarial text that outperforms state-of-the-art attacks in terms of success rate and perturbation rate, and is utility-preserving, which preserves semantic content and grammaticality and remains correctly classified by humans. Expand
Security Risks in Deep Learning Implementations
TLDR
Risks caused by a set of vulnerabilities in popular deep learning frameworks including Caffe, TensorFlow, and Torch are considered by studying their impact on common deep learning applications such as voice recognition and image classification. Expand
Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder
TLDR
Results show that a victim BERT finetuned classifier’s predictions can be steered to the poison target class with success rates of >80\% when the input hypothesis is injected with the poison signature, demonstrating that NLI and text classification systems face a huge security risk. Expand
...
1
2
3
4
5
...