Corpus ID: 236171089

Spinning Sequence-to-Sequence Models with Meta-Backdoors

  title={Spinning Sequence-to-Sequence Models with Meta-Backdoors},
  author={E. Bagdasaryan and Vitaly Shmatikov},
We investigate a new threat to neural sequence-to-sequence (seq2seq) models: training-time attacks that cause models to “spin” their output and support a certain sentiment when the input contains adversary-chosen trigger words. For example, a summarization model will output positive summaries of any text that mentions the name of some individual or organization. We introduce the concept of a “meta-backdoor” to explain model-spinning attacks. These attacks produce models whose output is valid… Expand

Figures and Tables from this paper


BadNL: Backdoor Attacks Against NLP Models
This paper presents the first systematic investigation of the backdoor attack against models designed for natural language processing (NLP) tasks, and proposes three methods to construct triggers in the NLP setting, including Char-level, Word- level, and Sentence-level triggers. Expand
Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples
A projected gradient method combined with group lasso and gradient regularization is proposed for crafting adversarial examples for sequence-to-sequence (seq2seq) models, whose inputs are discrete text strings and outputs have an almost infinite number of possibilities. Expand
Customizing Triggers with Concealed Data Poisoning
This work develops a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. Expand
Targeted Poisoning Attacks on Black-Box Neural Machine Translation
It is shown that targeted attacks on black-box NMT systems are feasible, based on poisoning a small fraction of their parallel training data, and this attack can be realised practically via targeted corruption of web documents crawled to form the system’s training data. Expand
DeepCleanse: A Black-box Input SanitizationFramework Against Backdoor Attacks on DeepNeural Networks
To the best of the knowledge, this is the first method in backdoor defense that works in black-box setting capable of sanitizing and restoring trojaned input that neither requires costly ground-truth labeled data nor anomaly detection. Expand
Sequence to Sequence Learning with Neural Networks
This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier. Expand
How To Backdoor Federated Learning
This work designs and evaluates a new model-poisoning methodology based on model replacement and demonstrates that any participant in federated learning can introduce hidden backdoor functionality into the joint global model, e.g., to ensure that an image classifier assigns an attacker-chosen label to images with certain features. Expand
Trojaning Attack on Neural Networks
A trojaning attack on neuron networks that can be successfully triggered without affecting its test accuracy for normal input data, and it only takes a small amount of time to attack a complex neuron network model. Expand
Generating Natural Language Adversarial Examples
A black-box population-based optimization algorithm is used to generate semantically and syntactically similar adversarial examples that fool well-trained sentiment analysis and textual entailment models with success rates of 97% and 70%, respectively. Expand
Defending Against Neural Fake News
A model for controllable text generation called Grover, found that best current discriminators can classify neural fake news from real, human-written, news with 73% accuracy, assuming access to a moderate level of training data, and the best defense against Grover turns out to be Grover itself, with 92% accuracy. Expand