Detecting AI Trojans Using Meta Neural Analysis

  title={Detecting AI Trojans Using Meta Neural Analysis},
  author={Xiaojun Xu and Qi Wang and Huichen Li and Nikita Borisov and Carl A. Gunter and Bo Li},
  journal={2021 IEEE Symposium on Security and Privacy (SP)},
  • Xiaojun Xu, Qi Wang, Bo Li
  • Published 8 October 2019
  • Computer Science
  • 2021 IEEE Symposium on Security and Privacy (SP)
In machine learning Trojan attacks, an adversary trains a corrupted model that obtains good performance on normal data but behaves maliciously on data samples with certain trigger patterns. Several approaches have been proposed to detect such attacks, but they make undesirable assumptions about the attack strategies or require direct access to the trained models, which restricts their utility in practice.This paper addresses these challenges by introducing a Meta Neural Trojan Detection (MNTD… 

Odyssey: Creation, Analysis and Detection of Trojan Models

A detector based upon the analysis of intrinsic properties of DNN that could get affected by a Trojan attack is developed; it reveals that Trojan attacks affect the classifier margin and shape of decision boundary around the manifold of the clean data.

Practical Detection of Trojan Neural Networks: Data-Limited and Data-Free Cases

A data-limited TrojanNet detector (TND) is proposed, which can detect a TrojanNet without accessing any data samples, and it is shown that such a TND can be built by leveraging the internal response of hidden neurons, which exhibits the Trojan behavior even at random noise inputs.

Deep Feature Space Trojan Attack of Neural Networks by Controlled Detoxification

A novel deep feature space trojan attack with five characteristics: effectiveness, stealthiness, controllability, robustness and reliance on deep features is proposed and can evade state-of-the-art defense.

A Unified Framework for Analyzing and Detecting Malicious Examples of DNN Models

A unified framework for detecting malicious examples and protecting the inference results of Deep Learning models is presented, based on the observation that both adversarial examples and backdoor examples have anomalies during the inference process, highly distinguishable from benign samples.

ML-Doctor: Holistic Risk Assessment of Inference Attacks Against Machine Learning Models

The extensive experimental evaluation conducted over five model architectures and four datasets shows that the complexity of the training dataset plays an important role with respect to the attack’s performance, while the effectiveness of model stealing and membership inference attacks are negatively correlated.

Scalable Backdoor Detection in Neural Networks

A novel trigger reverse-engineering based approach whose computational complexity does not scale with the number of labels, and is based on a measure that is both interpretable and universal across different network and patch types is proposed.

Exposing Backdoors in Robust Machine Learning Models

It is demonstrated that adversarially robust models are susceptible to backdoor attacks and observed that backdoors are reflected in the feature representation of such models, which is leveraged to detect backdoor-infected models via a detection technique called AEGIS.

EX-RAY: Distinguishing Injected Backdoor from Natural Features in Neural Networks by Examining Differential Feature Symmetry

A novel symmetric feature differencing method that identifies a smallest set of features separating two classes that outperforms false positive removal methods using L2 distance and attribution techniques and demonstrates its potential in detecting a number of semantic backdoor attacks.

Noise-response Analysis for Rapid Detection of Backdoors in Deep Neural Networks

This work proposes a rapid feature-generation step in which DNNs respond to noise-infused images with varying noise intensity, which results in titration curves, which are a type of `fingerprinting' for Dnns and can accurately detect a backdoor with high confidence orders-of-magnitude faster than existing approaches.

Data Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses

The goal of this work is to systematically categorize and discuss a wide range of data poisoning and backdoor attacks, approaches to defending against these threats, and an array of open problems in this space.



Trojaning Attack on Neural Networks

A trojaning attack on neuron networks that can be successfully triggered without affecting its test accuracy for normal input data, and it only takes a small amount of time to attack a complex neuron network model.

STRIP: a defence against trojan attacks on deep neural networks

This work builds STRong Intentional Perturbation (STRIP) based run-time trojan attack detection system and focuses on vision system, which achieves an overall false acceptance rate (FAR) of less than 1%, given a preset false rejection rate (FRR) of 1%, for different types of triggers.

Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering

This work proposes a novel approach to backdoor detection and removal for neural networks that is the first methodology capable of detecting poisonous data crafted to insert backdoors and repairing the model that does not require a verified and trusted dataset.

DeepInspect: A Black-box Trojan Detection and Mitigation Framework for Deep Neural Networks

This work proposes DeepInspect, the first black-box Trojan detection solution with minimal prior knowledge of the model, which learns the probability distribution of potential triggers from the queried model using a conditional generative model and retrieves the footprint of backdoor insertion.

Neural Trojans

This work shows that embedding hidden malicious functionality, i.e neural Trojans, into the neural IP is an effective attack and provides three mitigation techniques: input anomaly detection, re-training, and input preprocessing.

Hardware Trojan Attacks on Neural Networks

A novel framework for inserting malicious hardware Trojans in the implementation of a neural network classifier is developed, and the results show that the proposed algorithm could effectively classify a selected input trigger as a specified class on the MNIST dataset.

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

It is shown that outsourced training introduces new security risks: an adversary can create a maliciously trained network (a backdoored neural network, or a BadNet) that has state-of-the-art performance on the user's training and validation samples, but behaves badly on specific attacker-chosen inputs.

Generative Poisoning Attack Method Against Neural Networks

This work first examines the possibility of applying traditional gradient-based method to generate poisoned data against NNs by leveraging the gradient of the target model w.r.t. the normal data, and proposes a generative method to accelerate the generation rate of the poisoned data.

NIC: Detecting Adversarial Samples with Neural Network Invariant Checking

This paper analyzes the internals of DNN models under various attacks and identifies two common exploitation channels: the provenance channel and the activation value distribution channel, and proposes a novel technique to extract DNN invariants and use them to perform runtime adversarial sample detection.

Membership Inference Attacks Against Machine Learning Models

This work quantitatively investigates how machine learning models leak information about the individual data records on which they were trained and empirically evaluates the inference techniques on classification models trained by commercial "machine learning as a service" providers such as Google and Amazon.