White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks

  title={White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks},
  author={Yotam Gil and Yoav Chai and O. Gorodissky and Jonathan Berant},
Adversarial examples are important for understanding the behavior of neural models, and can improve their robustness through adversarial training. Recent work in natural language processing generated adversarial examples by assuming white-box access to the attacked model, and optimizing the input directly against it (Ebrahimi et al., 2018). In this work, we show that the knowledge implicit in the optimization procedure can be distilled into another more efficient neural network. We train a… Expand
A Differentiable Language Model Adversarial Attack on Text Classifiers
This paper fine-tunes a pre-trained language model to generate adversarial examples and proposes a new black-box sentence-level attack that outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation. Expand
Improving Zero-Shot Cross-Lingual Transfer Learning via Robust Training
It is proposed to use robust training methods to train a robust model that can tolerate some noise in input embeddings and demonstrated that robust training can improve zero-shot crosslingual transfer for text classification. Expand
A Survey on Adversarial Attack in the Age of Artificial Intelligence
Facing the increasingly complex neural network model, this paper focuses on the fields of image, text, and malicious code and focuses onThe adversarial attack classifications and methods of these three data types, so that researchers can quickly find their own type of study. Expand
Adversarial Attacks and Defense on Textual Data: A Review
This manuscript accumulated and analyzed different attacking techniques, various defense models on how to overcome weakness to noises which forces the model to misclassify and points out some of the interesting findings and challenges that need to be overcome in order to move forward in this field. Expand
Adversarial Attacks and Defense on Texts: A Survey
This manuscript accumulated and analyzed different attacking techniques and various defense models to provide a more comprehensive idea of how deep learning models possess weakness to noises which force the model to misclassify. Expand
Knowledge Distillation: A Survey
A comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, distillation algorithms and applications is provided. Expand
Knowledge Distillation as Semiparametric Inference
A semiparametric inference problem with the optimal student model as the target, the unknown Bayes class probabilities as nuisance, and the teacher probabilities as a plug-in nuisance estimate is cast and two enhancements are developed to mitigate the impact of teacher overfitting and underfitting on student performance. Expand
Imitation Attacks and Defenses for Black-box Machine Translation Systems
A defense that modifies translation outputs in order to misdirect the optimization of imitation models is proposed, which degrades imitation model BLEU and attack transfer rates at some cost in BLEu and inference speed. Expand
Towards a Robust Deep Neural Network in Texts: A Survey
A taxonomy of adversarial attacks and defenses in texts from the perspective of different natural language processing (NLP) tasks is given, and how to build a robust DNN model via testing and verification is introduced. Expand


Delving into Transferable Adversarial Examples and Black-box Attacks
This work is the first to conduct an extensive study of the transferability over large models and a large scale dataset, and it is also theFirst to study the transferabilities of targeted adversarial examples with their target labels. Expand
Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers
A novel algorithm is presented, DeepWordBug, to effectively generate small text perturbations in a black-box setting that forces a deep-learning classifier to misclassify a text input. Expand
Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples
New transferability attacks between previously unexplored (substitute, victim) pairs of machine learning model classes, most notably SVMs and decision trees are introduced. Expand
Explaining and Harnessing Adversarial Examples
It is argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Expand
HotFlip: White-Box Adversarial Examples for Text Classification
An efficient method to generate white-box adversarial examples to trick a character-level neural classifier based on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors is proposed. Expand
Adversarial examples in the physical world
It is found that a large fraction of adversarial examples are classified incorrectly even when perceived through the camera, which shows that even in physical world scenarios, machine learning systems are vulnerable to adversarialExamples. Expand
Shielding Google's language toxicity model against adversarial attacks
This paper characterises such adversarial attacks as using obfuscation and polarity transformations, and proposes a two--stage approach to counter--attack these anomalies, based upon a recently proposed text deobfuscation method and the toxicity scoring model. Expand
Semantically Equivalent Adversarial Rules for Debugging NLP models
This work presents semantically equivalent adversaries (SEAs) – semantic-preserving perturbations that induce changes in the model’s predictions that induce adversaries on many instances that are extremely similar semantically. Expand
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks
A combination of automated and human evaluations show that SCPNs generate paraphrases that follow their target specifications without decreasing paraphrase quality when compared to baseline (uncontrolled) paraphrase systems. Expand
Pathologies of Neural Models Make Interpretations Difficult
This work uses input reduction, which iteratively removes the least important word from the input, to expose pathological behaviors of neural models: the remaining words appear nonsensical to humans and are not the ones determined as important by interpretation methods. Expand