Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients

@inproceedings{Ross2018ImprovingTA,
  title={Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients},
  author={Andrew Slavin Ross and Finale Doshi-Velez},
  booktitle={AAAI},
  year={2018}
}
Deep neural networks have proven remarkably effective at solving many classification problems, but have been criticized recently for two major weaknesses: the reasons behind their predictions are uninterpretable, and the predictions themselves can often be fooled by small adversarial perturbations. [] Key Result Finally, we demonstrate that regularizing input gradients makes them more naturally interpretable as rationales for model predictions.

Improved and Interpretable Defense to Transferred Adversarial Examples by Jacobian Norm with Selective Input Gradient Regularization

TLDR
An approach based on Jacobian norm and Selective Input Gradient Regularization (J-SIGR), which suggests the linearized robustness through Jacobian normalization and also regularizes the perturbation-based saliency maps to imitate the model’s interpretable predictions, achieves both the improved defense and high interpretability of DNNs.

Jacobian Norm with Selective Input Gradient Regularization for Improved and Interpretable Adversarial Defense

TLDR
A novel approach based on Jacobian norm and Selective Input Gradient Regularization (J-SIGR), which suggests the linearized robustness through Jacobian normalization and also regularizes the perturbation-based saliency maps to imitate the model’s interpretable predictions, achieves both the improved defense and high interpretability of DNNs.

Improved Methodology for Evaluating Adversarial Robustness in Deep Neural Networks

  • Lee
  • Computer Science
  • 2020
TLDR
This work identifies three common cases that lead to overestimation of accuracy against perturbed examples generated by bounded first-order attack methods and proposes compensation methods that address sources of inaccurate gradient computation, such as numerical saturation for near zero values and nondifferentiability.

Towards Understanding and Improving the Transferability of Adversarial Examples in Deep Neural Networks

TLDR
This work empirically investigates two classes of factors that might influence the transferability of adversarial examples, including model-specific factors, including network architecture, model capacity and test accuracy, and proposes a simple but effective strategy to improve the transferable.

Deep Defense: Training DNNs with Improved Adversarial Robustness

TLDR
This work proposes a training recipe named "deep defense", which integrates an adversarial perturbation-based regularizer into the classification objective, such that the obtained models learn to resist potential attacks, directly and precisely.

Towards Robust Training of Neural Networks by Regularizing Adversarial Gradients

TLDR
The fundamental mechanisms behind adversarial examples are investigated and a novel robust training method via regulating adversarial gradients is proposed, which effectively squeezes the adversarial Gradient gradients of neural networks and significantly increases the difficulty of adversarial example generation.

Jacobian Adversarially Regularized Networks for Robustness

Adversarial examples are crafted with imperceptible perturbations with the intent to fool neural networks. Against such attacks, adversarial training and its variants stand as the strongest defense

Improving Adversarial Robustness Requires Revisiting Misclassified Examples

TLDR
This paper proposes a new defense algorithm called MART, which explicitly differentiates the misclassified and correctly classified examples during the training, and shows that MART and its variant could significantly improve the state-of-the-art adversarial robustness.

Towards Improving Robustness of Deep Neural Networks to Adversarial Perturbations

TLDR
This work shows how a deep convolutional neural network (CNN), based on non-smooth regularization of convolution and fully connected layers, can present enhanced generalization and robustness to adversarial perturbation, simultaneously.

Understanding and Enhancing the Transferability of Adversarial Examples

TLDR
This work systematically study how two classes of factors that might influence the transferability of adversarial examples are influenced, including model-specific factors, including network architecture, model capacity and test accuracy, and the local smoothness of loss function for constructing adversarial example.
...

References

SHOWING 1-10 OF 36 REFERENCES

Explaining and Harnessing Adversarial Examples

TLDR
It is argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets.

The Limitations of Deep Learning in Adversarial Settings

TLDR
This work formalizes the space of adversaries against deep neural networks (DNNs) and introduces a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs.

Ensemble Adversarial Training: Attacks and Defenses

TLDR
This work finds that adversarial training remains vulnerable to black-box attacks, where perturbations computed on undefended models are transferred to a powerful novel single-step attack that escapes the non-smooth vicinity of the input data via a small random step.

Distillation as a Defense to Adversarial Perturbations Against Deep Neural Networks

TLDR
The study shows that defensive distillation can reduce effectiveness of sample creation from 95% to less than 0.5% on a studied DNN, and analytically investigates the generalizability and robustness properties granted by the use of defensive Distillation when training DNNs.

The Space of Transferable Adversarial Examples

TLDR
It is found that adversarial examples span a contiguous subspace of large (~25) dimensionality, which indicates that it may be possible to design defenses against transfer-based attacks, even for models that are vulnerable to direct attacks.

Adversarial Machine Learning at Scale

TLDR
This research applies adversarial training to ImageNet and finds that single-step attacks are the best for mounting black-box attacks, and resolution of a "label leaking" effect that causes adversarially trained models to perform better on adversarial examples than on clean examples.

Biologically inspired protection of deep networks from adversarial attacks

TLDR
This scheme generates highly nonlinear, saturated neural networks that achieve state of the art performance on gradient based adversarial examples on MNIST, despite never being exposed to adversarially chosen examples during training.

Synthesizing Robust Adversarial Examples

TLDR
The existence of robust 3D adversarial objects is demonstrated, and the first algorithm for synthesizing examples that are adversarial over a chosen distribution of transformations is presented, which synthesizes two-dimensional adversarial images that are robust to noise, distortion, and affine transformation.

Improved Training of Wasserstein GANs

TLDR
This work proposes an alternative to clipping weights: penalize the norm of gradient of the critic with respect to its input, which performs better than standard WGAN and enables stable training of a wide variety of GAN architectures with almost no hyperparameter tuning.

Adversarial examples in the physical world

TLDR
It is found that a large fraction of adversarial examples are classified incorrectly even when perceived through the camera, which shows that even in physical world scenarios, machine learning systems are vulnerable to adversarialExamples.