Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

@inproceedings{Ravfogel2020NullIO,
  title={Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection},
  author={Shauli Ravfogel and Yanai Elazar and Hila Gonen and Michael Twiton and Yoav Goldberg},
  booktitle={ACL},
  year={2020}
}
The ability to control for the kinds of information encoded in neural representation has a variety of use cases, especially in light of the challenge of interpreting these models. We present Iterative Null-space Projection (INLP), a novel method for removing information from neural representations. Our method is based on repeated training of linear classifiers that predict a certain property we aim to remove, followed by projection of the representations on their null-space. By doing so, the… 
Contrastive Learning for Fair Representations
TLDR
This paper proposes a method for mitigating bias in classifier training by incorporating contrastive learning, in which instances sharing the same class label are encouraged to have similar representations, while instances sharing a protected attribute are forced further apart.
OSCaR: Orthogonal Subspace Correction and Rectification of Biases in Word Embeddings
TLDR
OSCaR (Orthogonal Subspace Correction and Rectification), a bias-mitigating method that focuses on disentangling biased associations between concepts instead of removing concepts wholesale, is proposed.
Dynamically Disentangling Social Bias from Task-Oriented Representations with Adversarial Attack
TLDR
An adversarial disentangled debiasing model is proposed to dynamically decouple social bias attributes from the intermediate representations trained on the main task, aiming to denoise bias information while training on the downstream task, rather than completely remove social bias and pursue static unbiased representations.
Efficiently Mitigating Classification Bias via Transfer Learning
TLDR
The proposed Upstream Bias Mitigation for Downstream Fine-Tuning (UBM) framework is proposed, which mitigate one or multiple bias factors in downstream classifiers by transfer learning from an upstream model.
The Geometry of Distributed Representations for Better Alignment, Attenuated Bias, and Improved Interpretability
TLDR
This work addresses some of the problems pertaining to the transparency and interpretability of high-dimensional representations of language representation, including the detection, quantification, and mitigation of socially biased associations in language representation.
CausaLM: Causal Model Explanation Through Counterfactual Language Models
TLDR
CausaLM is proposed, a framework for producing causal model explanations using counterfactual language representation models based on fine-tuning of deep contextualized embedding models with auxiliary adversarial tasks derived from the causal graph of the problem.
Evaluating Debiasing Techniques for Intersectional Biases
TLDR
It is argued that a truly fair model must consider ‘gerrymandering’ groups which comprise not only single attributes, but also intersectional groups, and an extension of the iterative nullspace projection technique which can handle multiple protected attributes is evaluated.
UNQOVERing Stereotypical Biases via Underspecified Questions
TLDR
UNQOVER, a general framework to probe and quantify biases through underspecified questions, is presented, showing that a naive use of model scores can lead to incorrect bias estimates due to two forms of reasoning errors: positional dependence and question independence.
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
TLDR
This paper demonstrates a surprising finding: Pretrained language models recognize, to a considerable degree, their undesirable biases and the toxicity of the content they produce and proposes a decoding algorithm that reduces the probability of a language model producing problematic text, known as self-debiasing.
Marked Attribute Bias in Natural Language Inference
TLDR
A new observation of gender bias in a downstream NLP application: marked attribute bias in natural language inference is presented, and a new postprocessing debiasing scheme for static word embeddings is proposed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 42 REFERENCES
Controllable Invariance through Adversarial Feature Learning
TLDR
This paper shows that the proposed framework induces an invariant representation, and leads to better generalization evidenced by the improved performance on three benchmark tasks.
Adversarial Removal of Demographic Attributes Revisited
TLDR
It is shown that a diagnostic classifier trained on the biased baseline neural network also does not generalize to new samples, indicating that it relies on correlations specific to their particular data sample.
Mitigating Unwanted Biases with Adversarial Learning
TLDR
This work presents a framework for mitigating biases concerning demographic groups by including a variable for the group of interest and simultaneously learning a predictor and an adversary, which results in accurate predictions that exhibit less evidence of stereotyping Z.
Disentangling factors of variation in deep representation using adversarial training
TLDR
A conditional generative model for learning to disentangle the hidden factors of variation within a set of labeled observations, and separate them into complementary codes that are capable of generalizing to unseen classes and intra-class variabilities.
What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes
TLDR
This work proposes a method for discouraging correlation between the predicted probability of an individual’s true occupation and a word embedding of their name, which leverages the societal biases that are encoded in word embeddings, eliminating the need for access to protected attributes.
Privacy-preserving Neural Representations of Text
TLDR
This article measures the privacy of a hidden representation by the ability of an attacker to predict accurately specific private information from it and characterize the tradeoff between the privacy and the utility of neural representations.
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
TLDR
This work empirically demonstrates that its algorithms significantly reduce gender bias in embeddings while preserving the its useful properties such as the ability to cluster related concepts and to solve analogy tasks.
Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them
Word embeddings are widely used in NLP for a vast range of tasks. It was shown that word embeddings derived from text corpora reflect gender biases in society, causing serious concern. Several recent
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Understanding Undesirable Word Embedding Associations
TLDR
It is shown that for any embedding model that implicitly does matrix factorization, debiasing vectors post hoc using subspace projection is, under certain conditions, equivalent to training on an unbiased corpus, and that WEAT, the most common association test for word embeddings, systematically overestimates bias.
...
1
2
3
4
5
...