• Corpus ID: 238857225

DI-AA: An Interpretable White-box Attack for Fooling Deep Neural Networks

  title={DI-AA: An Interpretable White-box Attack for Fooling Deep Neural Networks},
  author={Yixiang Wang and Jiqiang Liu and Xiaolin Chang and Jianhua Wang and Ricardo J. Rodr'iguez},
White-box Adversarial Example (AE) attacks towards Deep Neural Networks (DNNs) have a more powerful destructive capacity than black-box AE attacks in the fields of AE strategies. However, almost all the white-box approaches lack interpretation from the point of view of DNNs. That is, adversaries did not investigate the attacks from the perspective of interpretable features, and few of these approaches considered what features the DNN actually learns. In this paper, we propose an interpretable… 

Figures and Tables from this paper


ZOO: Zeroth Order Optimization Based Black-box Attacks to Deep Neural Networks without Training Substitute Models
An effective black-box attack that also only has access to the input (images) and the output (confidence scores) of a targeted DNN is proposed, sparing the need for training substitute models and avoiding the loss in attack transferability.
IWA: Integrated Gradient based White-box Attacks for Fooling Deep Neural Networks
This paper proposes two Integrated gradient based White-box Adversarial example generation algorithms (IWA): IFPA and IUA, and verifies the effectiveness of the proposed algorithms on both structured and unstructured datasets, and compares them with five baseline generation algorithms.
Nesterov Accelerated Gradient and Scale Invariance for Adversarial Attacks
NI-FGSM and SIM can be naturally integrated to build a robust gradient-based attack to generate more transferable adversarial examples against the defense models and demonstrate that the attack methods exhibit higher transferability and achieve higher attack success rates than state-of-the-art gradient- based attacks.
Boosting the Transferability of Adversarial Samples via Attention
This work proposes a novel mechanism that computes model attention over extracted features to regularize the search of adversarial examples, which prioritizes the corruption of critical features that are likely to be adopted by diverse architectures and can promote the transferability of resultant adversarial instances.
Towards Interpretable Deep Neural Networks by Leveraging Adversarial Examples
This work aims to increase the interpretability of DNNs on the whole image space by reducing the ambiguity of neurons by proposing a metric to evaluate the consistency level of neurons in a network quantitatively.
Parsimonious Black-Box Adversarial Attacks via Efficient Combinatorial Optimization
This work proposes an efficient discrete surrogate to the optimization problem which does not require estimating the gradient and consequently becomes free of the first order update hyperparameters to tune.
Boosting Adversarial Attacks with Momentum
A broad class of momentum-based iterative algorithms to boost adversarial attacks by integrating the momentum term into the iterative process for attacks, which can stabilize update directions and escape from poor local maxima during the iterations, resulting in more transferable adversarial examples.
Towards Deep Learning Models Resistant to Adversarial Attacks
This work studies the adversarial robustness of neural networks through the lens of robust optimization, and suggests the notion of security against a first-order adversary as a natural and broad security guarantee.
Towards Evaluating the Robustness of Neural Networks
It is demonstrated that defensive distillation does not significantly increase the robustness of neural networks, and three new attack algorithms are introduced that are successful on both distilled and undistilled neural networks with 100% probability are introduced.
A Frank-Wolfe Framework for Efficient and Effective Adversarial Attacks
This paper proposes a novel adversarial attack framework for both white-box and black-box settings based on a variant of Frank-Wolfe algorithm, and shows in theory that the proposed attack algorithms are efficient with an $O(1/\sqrt{T})$ convergence rate.