Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation

@article{Huang2019AchievingVR,
  title={Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation},
  author={Po-Sen Huang and Robert Stanforth and Johannes Welbl and Chris Dyer and Dani Yogatama and Sven Gowal and Krishnamurthy Dvijotham and Pushmeet Kohli},
  journal={ArXiv},
  year={2019},
  volume={abs/1909.01492}
}
Neural networks are part of many contemporary NLP systems, yet their empirical successes come at the price of vulnerability to adversarial attacks. Previous work has used adversarial training and data augmentation to partially mitigate such brittleness, but these are unlikely to find worst-case adversaries due to the complexity of the search space arising from discrete text perturbations. In this work, we approach the problem from the opposite direction: to formally verify a system’s robustness… 

Figures and Tables from this paper

Certified Robustness to Word Substitution Attack with Differential Privacy
TLDR
This paper establishes the connection between DP and adversarial robustness for the first time in the text domain and proposes a conceptual exponential mechanism-based algorithm to formally achieve the robustness.
Quantifying Robustness to Adversarial Word Substitutions
TLDR
A robustness metric with a rigorous statistical guarantee is introduced to measure the quantification of adversarial examples, which indicates the model’s susceptibility to perturbations outside the safe radius, and helps figure out why state-of-the-art models like BERT can be easily fooled by a few word substitutions, but generalize well in the presence of real-world noises.
Achieving Model Robustness through Discrete Adversarial Training
TLDR
Surprisingly, it is found that random sampling leads to impressive gains in robustness, outperforming the commonly-used offline augmentation, while leading to a speedup at training time of ~10x.
BERT is Robust! A Case Against Synonym-Based Adversarial Examples in Text Classification
TLDR
This paper investigates four word substitution-based attacks on BERT and concludes that BERT is a lot more robust than research on attacks suggests.
COMBATING ADVERSARIAL TYPOS
  • Computer Science
  • 2019
Despite achieving excellent benchmark performance, state-of-the-art NLP models can still be easily fooled by adversarial perturbations such as typos. Previous heuristic defenses cannot guard against
T3: Tree-Autoencoder Constrained Adversarial Text Generation for Targeted Attack
TLDR
T3 generated adversarial texts can successfully manipulate the NLP models to output the targeted incorrect answer without misleading the human and have high transferability which enables the black-box attacks in practice.
Robust Encodings: A Framework for Combating Adversarial Typos
TLDR
This work introduces robust encodings (RobEn), a simple framework that confers guaranteed robustness, without making compromises on model architecture, and instantiates RobEn to defend against a large family of adversarial typos.
SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions
TLDR
This work proposes a certified robust method based on a new randomized smoothing technique, which constructs a stochastic ensemble by applying random word substitutions on the input sentences, and leverage the statistical properties of the ensemble to provably certify the robustness.
T3: Tree-Autoencoder Regularized Adversarial Text Generation for Targeted Attack
TLDR
T3 generated adversarial texts can successfully manipulate the NLP models to output the targeted incorrect answer without misleading the human and have high transferability which enables the black-box attacks in practice.
Certified Robustness Against Natural Language Attacks by Causal Intervention
TLDR
Causal Intervention by Semantic Smoothing (CISS), a novel framework towards robustness against natural language attacks, learns causal effects p ( y | do ( x )) by smoothing in the latent semantic space to make robust predictions, which scales to deep architectures and avoids tedious construction of noise customized for specific attacks.
...
...

References

SHOWING 1-10 OF 49 REFERENCES
On the Effectiveness of Interval Bound Propagation for Training Verifiably Robust Models
TLDR
This work shows how a simple bounding technique, interval bound propagation (IBP), can be exploited to train large provably robust neural networks that beat the state-of-the-art in verified accuracy and allows the largest model to be verified beyond vacuous bounds on a downscaled version of ImageNet.
Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples
TLDR
This paper proposes a projected gradient method combined with group lasso and gradient regularization for sequence-to-sequence (seq2seq) models, whose inputs are discrete text strings and outputs have an almost infinite number of possibilities.
Provable defenses against adversarial examples via the convex outer adversarial polytope
TLDR
A method to learn deep ReLU-based classifiers that are provably robust against norm-bounded adversarial perturbations, and it is shown that the dual problem to this linear program can be represented itself as a deep network similar to the backpropagation network, leading to very efficient optimization approaches that produce guaranteed bounds on the robust loss.
Knowing When to Stop: Evaluation and Verification of Conformity to Output-Size Specifications
TLDR
This paper develops an easy-to-compute differentiable proxy objective that can be used with gradient-based algorithms to find output-lengthening inputs and develops a verification approach to formally prove that the network cannot produce outputs greater than a certain length.
Formal Security Analysis of Neural Networks using Symbolic Intervals
TLDR
This paper designs, implements, and evaluates a new direction for formally checking security properties of DNNs without using SMT solvers, and leverages interval arithmetic to compute rigorous bounds on the DNN outputs, which is easily parallelizable.
Training verified learners with learned verifiers
TLDR
Experiments show that the predictor-verifier architecture able to train networks to achieve state of the art verified robustness to adversarial examples with much shorter training times can be scaled to produce the first known verifiably robust networks for CIFAR-10.
Provably Minimally-Distorted Adversarial Examples
TLDR
It is demonstrated that one of the recent ICLR defense proposals, adversarial retraining, provably succeeds at increasing the distortion required to construct adversarial examples by a factor of 4.2.
Towards Deep Learning Models Resistant to Adversarial Attacks
TLDR
This work studies the adversarial robustness of neural networks through the lens of robust optimization, and suggests the notion of security against a first-order adversary as a natural and broad security guarantee.
Certified Defenses against Adversarial Examples
TLDR
This work proposes a method based on a semidefinite relaxation that outputs a certificate that for a given network and test input, no attack can force the error to exceed a certain value, providing an adaptive regularizer that encourages robustness against all attacks.
Ground-Truth Adversarial Examples
TLDR
Ground truths are constructed: adversarial examples with a provably-minimal distance from a given input point that can serve to assess the effectiveness of attack techniques and also of defense techniques, by computing the distance to the ground truths before and after the defense is applied, and measuring the improvement.
...
...