Interpreting Attributions and Interactions of Adversarial Attacks

  title={Interpreting Attributions and Interactions of Adversarial Attacks},
  author={Xin Wang and Shuyu Lin and Hao Zhang and Yufei Zhu and Quanshi Zhang},
  journal={2021 IEEE/CVF International Conference on Computer Vision (ICCV)},
This paper aims to explain adversarial attacks in terms of how adversarial perturbations contribute to the attacking task. We estimate attributions of different image regions to the decrease of the attacking cost based on the Shapley value. We define and quantify interactions among adversarial perturbation pixels, and decompose the entire perturbation map into relatively independent perturbation components. The decomposition of the perturbation map shows that adversarially-trained DNNs have… 

Figures and Tables from this paper

A Unified Game-Theoretic Interpretation of Adversarial Robustness

It is discovered that adversarial attacks mainly affect high-order interactions to fool the DNN, and the robustness of adversarially trained DNNs comes from category-specific low- order interactions.

Visualizing the Emergence of Intermediate Visual Patterns in DNNs

A method to visualize the discrimination power of intermediate-layer visual patterns encoded by a DNN and provides new insights into signal-processing behaviors of existing deep-learning techniques, such as adversarial attacks and knowledge distillation.



A Unified Approach to Interpreting and Boosting Adversarial Transferability

It is proved that some classic methods of enhancing the transferability essentially decease interactions inside adversarial perturbations and proposed to directly penalize interactions during the attacking process, which significantly improves the adversarial transferability.

Game-theoretic Understanding of Adversarially Learned Features

With the multi-order interaction, it is discovered that adversarial attacks mainly affect high-order interactions to fool the DNN, and the robustness of adversarially trained DNNs comes from category-specific low- order interactions.

Structured Adversarial Attack: Towards General Implementation and Better Interpretability

This work develops a more general attack model, i.e., the structured attack (StrAttack), which explores group sparsity in adversarial perturbations by sliding a mask through images aiming for extracting key spatial structures through adversarial saliency map and class activation map.

Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples

This work proposes a novel adversarial sample detection technique for face recognition models, based on interpretability, that features a novel bi-directional correspondence inference between attributes and internal neurons to identify neurons critical for individual attributes.

Detecting Adversarial Samples from Artifacts

This paper investigates model confidence on adversarial samples by looking at Bayesian uncertainty estimates, available in dropout neural networks, and by performing density estimation in the subspace of deep features learned by the model, and results show a method for implicit adversarial detection that is oblivious to the attack algorithm.

Explaining and Harnessing Adversarial Examples

It is argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets.

Boosting Adversarial Attacks with Momentum

A broad class of momentum-based iterative algorithms to boost adversarial attacks by integrating the momentum term into the iterative process for attacks, which can stabilize update directions and escape from poor local maxima during the iterations, resulting in more transferable adversarial examples.

Ensemble Adversarial Training: Attacks and Defenses

This work finds that adversarial training remains vulnerable to black-box attacks, where perturbations computed on undefended models are transferred to a powerful novel single-step attack that escapes the non-smooth vicinity of the input data via a small random step.

One Pixel Attack for Fooling Deep Neural Networks

This paper proposes a novel method for generating one-pixel adversarial perturbations based on differential evolution (DE), which requires less adversarial information (a black-box attack) and can fool more types of networks due to the inherent features of DE.

Delving into Transferable Adversarial Examples and Black-box Attacks

This work is the first to conduct an extensive study of the transferability over large models and a large scale dataset, and it is also theFirst to study the transferabilities of targeted adversarial examples with their target labels.