Corpus ID: 236318099

A Differentiable Language Model Adversarial Attack on Text Classifiers

  title={A Differentiable Language Model Adversarial Attack on Text Classifiers},
  author={I. Fursov and Alexey Zaytsev and Pavel Burnyshev and Ekaterina Dmitrieva and Nikita Klyuchnikov and Andrey Kravchenko and E. Artemova and Evgeny Burnaev},
Robustness of huge Transformer-based models for natural language processing is an important issue due to their capabilities and wide adoption. One way to understand and improve robustness of these models is an exploration of an adversarial attack scenario: check if a small perturbation of an input can fool a model. Due to the discrete nature of textual data, gradient-based adversarial methods, widely used in computer vision, are not applicable per se. The standard strategy to overcome this… Expand

Figures and Tables from this paper

Adversarial Bone Length Attack on Action Recognition
In this paper, it is shown that adversarial attacks can be performed on skeleton-based action recognition models, even in a significantly low-dimensional setting without any temporal manipulation, and an interesting phenomenon is discovered: in the lowdimensional setting, the adversarial training with the bone length attack not only improves the adversaries’ robustness but also improves the classification accuracy on the original original data. Expand


Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment
TextFooler, a simple but strong baseline to generate natural adversarial text that outperforms state-of-the-art attacks in terms of success rate and perturbation rate, and is utility-preserving, which preserves semantic content and grammaticality and remains correctly classified by humans. Expand
Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency
A new word replacement order determined by both the wordsaliency and the classification probability is introduced, and a greedy algorithm called probability weighted word saliency (PWWS) is proposed for text adversarial attack. Expand
Adversarial Attacks on Deep-learning Models in Natural Language Processing
A systematic survey on preliminary knowledge of NLP and related seminal works in computer vision is presented, which collects all related academic works since the first appearance in 2017 and analyzes 40 representative works in a comprehensive way. Expand
Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers
A novel algorithm is presented, DeepWordBug, to effectively generate small text perturbations in a black-box setting that forces a deep-learning classifier to misclassify a text input. Expand
Generating Natural Language Adversarial Examples on a Large Scale with Generative Models
This paper proposes an end to end solution to efficiently generate adversarial texts from scratch using generative models, which are not restricted to perturbing the given texts, and calls it unrestricted adversarial text generation. Expand
Joint Character-Level Word Embedding and Adversarial Stability Training to Defend Adversarial Text
This paper proposes a framework which jointly uses the character embedding and the adversarial stability training to overcome the two main challenges in character-level adversarial examples defense: out-of-vocabulary words in word embedding model and the distribution difference between training and inference. Expand
Deep Text Classification Can be Fooled
An effective method to craft text adversarial samples that can successfully fool both state-of-the-art character-level and word-level DNN-based text classifiers and is difficult to be perceived. Expand
Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples
A projected gradient method combined with group lasso and gradient regularization is proposed for crafting adversarial examples for sequence-to-sequence (seq2seq) models, whose inputs are discrete text strings and outputs have an almost infinite number of possibilities. Expand
TextBugger: Generating Adversarial Text Against Real-world Applications
This paper presents TextBugger, a general attack framework for generating adversarial texts, and empirically evaluates its effectiveness, evasiveness, and efficiency on a set of real-world DLTU systems and services used for sentiment analysis and toxic content detection. Expand
HotFlip: White-Box Adversarial Examples for Text Classification
An efficient method to generate white-box adversarial examples to trick a character-level neural classifier based on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors is proposed. Expand