Attention Hijacking in Trojan Transformers

  title={Attention Hijacking in Trojan Transformers},
  author={Weimin Lyu and Songzhu Zheng and Teng Ma and Haibin Ling and Chao Chen},
Trojan attacks pose a severe threat to AI systems. Recent works on Transformer models received explosive popularity and the self-attentions are now indisputable. This raises a central question: Can we reveal the Trojans through attention mechanisms in BERTs and ViTs? In this paper, we investigate the attention hijacking pattern in Trojan AIs, i.e. , the trigger token “kidnaps” the attention weights when a specific trigger is present. We observe the consistent attention hijacking pattern in… 



Topological Detection of Trojaned Neural Networks

A strategy for robust detection of Trojaned models is devised and compared to standard baselines it displays better performance on multiple benchmarks.

A Survey of Neural Trojan Attacks and Defenses in Deep Learning

A comprehensive review of the techniques that devise Trojan attacks for deep learning and explore their defenses, and provides a comprehensible gateway to the broader community to understand the recent developments in Neural Trojans.

Trigger Hunting with a Topological Prior for Trojan Detection

This paper proposes innovative priors such as diversity and topological simplicity to not only increase the chances of finding the appropriate triggers but also improve the quality of the found triggers, resulting in substantially improved Trojan detection accuracy.

A Survey on Neural Trojans

This paper surveys a myriad of neural Trojan attack and defense techniques that have been proposed over the last few years and systematizes the above attack anddefense approaches.

RAP: Robustness-Aware Perturbations for Defending against Backdoor Attacks on NLP Models

This work constructs a word-based robustness-aware perturbation to distinguish poisoned samples from clean samples to defend against the backdoor attacks on natural language processing (NLP) models.

T-Miner: A Generative Approach to Defend Against Trojan Attacks on DNN-based Text Classification

T-Miner is presented -- a defense framework for Trojan attacks on DNN-based text classifiers that employs a sequence-to-sequence (seq-2-seq) generative model that probes the suspicious classifier and learns to produce text sequences that are likely to contain the Trojan trigger.

BadNL: Backdoor Attacks Against NLP Models

This paper presents the first systematic investigation of the backdoor attack against models designed for natural language processing (NLP) tasks, and proposes three methods to construct triggers in the NLP setting, including Char-level, Word- level, and Sentence-level triggers.

An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences

This overview of backdoor attacks is to review the works published until now, classifying the different types of attacks and defences proposed so far based on the amount of control that the attacker has on the training process, and the capability of the defender to verify the integrity of the data used for training.

Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models

This paper finds that it is possible to hack the model in a data-free way by modifying one single word embedding vector, with almost no accuracy sacrificed on clean samples.

Rethinking Stealthiness of Backdoor Attack against NLP Models

A novel word-based backdoor attacking method based on negative data augmentation and modifying word embeddings is proposed, making an important step towards achieving stealthy backdoor attacking.