Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation

@article{Tan2018DistillandCompareAB,
  title={Distill-and-Compare: Auditing Black-Box Models Using Transparent Model Distillation},
  author={S. Tan and Rich Caruana and Giles Hooker and Yin Lou},
  journal={Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society},
  year={2018}
}
  • S. Tan, R. Caruana, Yin Lou
  • Published 17 October 2017
  • Computer Science
  • Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society
Black-box risk scoring models permeate our lives, yet are typically proprietary or opaque. We propose Distill-and-Compare, an approach to audit such models without probing the black-box model API or pre-defining features to audit. To gain insight into black-box models, we treat them as teachers, training transparent student models to mimic the risk scores assigned by the black-box models. We compare the mimic model trained with distillation to a second, un-distilled transparent model trained on… 

Figures and Tables from this paper

The Road to Explainability is Paved with Bias: Measuring the Fairness of Explanations
TLDR
This work audits the quality of explainability methods for different protected subgroups using real data from four settings in finance, healthcare, college admissions, and the US justice system and finds that the approximation quality of explanation models differs significantly between subgroups.
Distilling Black-Box Travel Mode Choice Model for Behavioral Interpretation
TLDR
This paper proposes to apply and extend the model distillation approach, a model-agnostic machine-learning interpretation method, to explain how a black-box travel mode choice model makes predictions for the entire population and subpopulations of interest.
How can we fool LIME and SHAP? Adversarial Attacks on Post hoc Explanation Methods
TLDR
It is demonstrated how extremely biased (racist) classifiers crafted by the proposed framework can easily fool popular explanation techniques such as LIME and SHAP into generating innocuous explanations which do not reflect the underlying biases.
Patch Shortcuts: Interpretable Proxy Models Efficiently Find Black-Box Vulnerabilities
TLDR
This work presents an approach to detect learned shortcuts using an interpretable-by-design network as a proxy to the black-box model of interest, and demonstrates on the autonomous driving dataset A2D2 that extracted patch shortcuts significantly influence the black box model.
Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods
TLDR
It is demonstrated how extremely biased (racist) classifiers crafted by the proposed framework can easily fool popular explanation techniques such as LIME and SHAP into generating innocuous explanations which do not reflect the underlying biases.
Auditing Data Provenance in Text-Generation Models
TLDR
A new model auditing technique is developed that helps users check if their data was used to train a machine learning model, and it is empirically shown that the method can successfully audit well-generalized models that are not overfitted to the training data.
Manipulating and Measuring Model Interpretability
TLDR
A sequence of pre-registered experiments showed participants functionally identical models that varied only in two factors commonly thought to make machine learning models more or less interpretable: the number of features and the transparency of the model (i.e., whether the model internals are clear or black box).
Model Distillation for Faithful Explanations of Medical Code Predictions
TLDR
This work proposes to use knowledge distillation, or training a student model that mimics the behavior of a trained teacher model, as a technique to generate faithful and plausible explanations in explainable AI.
Why should you trust my interpretation? Understanding uncertainty in LIME predictions
TLDR
This work demonstrates the presence of two sources of uncertainty, namely the randomness in its sampling procedure and the variation of interpretation quality across different input data points in the method “Local Interpretable Model-agnostic Explanations" (LIME).
...
...

References

SHOWING 1-10 OF 58 REFERENCES
Auditing black-box models for indirect influence
TLDR
This paper presents a technique for auditing black-box models, which lets us study the extent to which existing models take advantage of particular features in the data set, without knowing how the models work.
Transparent Model Distillation
TLDR
This work investigates model distillation for transparency -- investigating if fully-connected neural networks can be distilled into models that are transparent or interpretable in some sense, and tries two types of student models.
Multiaccuracy: Black-Box Post-Processing for Fairness in Classification
TLDR
It is proved that MULTIACCURACY-BOOST converges efficiently and it is shown that if the initial model is accurate on an identifiable subgroup, then the post-processed model will be also.
Fairer and more accurate, but for whom?
TLDR
A model comparison framework for automatically identifying subgroups in which the differences between models are most pronounced, with a primary focus on identifying sub groups where the models differ in terms of fairness-related quantities such as racial or gender disparities is introduced.
Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR
TLDR
It is suggested data controllers should offer a particular type of explanation, unconditional counterfactual explanations, to support these three aims, which describe the smallest change to the world that can be made to obtain a desirable outcome, or to arrive at the closest possible world, without needing to explain the internal logic of the system.
Iterative Orthogonal Feature Projection for Diagnosing Bias in Black-Box Models
TLDR
An iterative procedure is presented, based on orthogonal projection of input attributes, for enabling interpretability of black-box predictive models, and can quantify the relative dependence of a black- box model on its input attributes.
Certifying and Removing Disparate Impact
TLDR
This work links disparate impact to a measure of classification accuracy that while known, has received relatively little attention and proposes a test for disparate impact based on how well the protected class can be predicted from the other attributes.
Interpretable classification models for recidivism prediction
TLDR
A recent method called supersparse linear integer models is used to produce accurate, transparent and interpretable scoring systems along the full ROC curve, which can be used for decision making for many different use cases.
Discriminatory Power - An Obsolete Validation Criterion?
In this paper we analyse two common measures of discriminatory power - the Accuracy Ratio and the Area Under the Receiver Operator Characteristic - in a probabilistic framework. Under the assumption
Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness
TLDR
It is proved that the computational problem of auditing subgroup fairness for both equality of false positive rates and statistical parity is equivalent to the problem of weak agnostic learning, which means it is computationally hard in the worst case, even for simple structured subclasses.
...
...