Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

  title={Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection},
  author={Shauli Ravfogel and Yanai Elazar and Hila Gonen and Michael Twiton and Yoav Goldberg},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
The ability to control for the kinds of information encoded in neural representation has a variety of use cases, especially in light of the challenge of interpreting these models. We present Iterative Null-space Projection (INLP), a novel method for removing information from neural representations. Our method is based on repeated training of linear classifiers that predict a certain property we aim to remove, followed by projection of the representations on their null-space. By doing so, the… 

Figures and Tables from this paper

Learning Disentangled Textual Representations via Statistical Measures of Similarity

This work introduces a family of regularizers for learning disentangled representations that do not require additional training, are faster and do not involve additional tuning while achieving better results both when combined with pretrained and randomly initialized text encoders.

Adversarial Concept Erasure in Kernel Space

This work proposes a kernalization of the linear concept-removal objective of Ravfogel et al.

Linear Guardedness and its Implications

It is shown that, in the binary case, the neutralized concept cannot be recovered by an additional linear layer, but it is pointed out that—contrary to what was im-plicitly argued in previous works—multiclass softmax classifiers can be constructed that indirectly recover the concept.

Linear Adversarial Concept Erasure

This paper formulates the problem of identifying and erasing a linear subspace that corresponds to a given concept in order to prevent linear predictors from recovering the concept, and recovers a low-dimensional subspace whose removal mitigates bias by intrinsic and extrinsic evaluation.


It is proved that post-hoc or adversarial methods to remove unwanted attributes from a model’s representation will fail to remove the attribute correctly, and a spuriousness metric is proposed to gauge the quality of the probing classifier.

Entropy-based Attention Regularization Frees Unintended Bias Mitigation from Lists

The resulting model matches or exceeds state-of-the-art performance for hate speech classification and bias metrics on three benchmark corpora in English and Italian and reveals overfitting terms, i.e., terms most likely to induce bias, to help identify their effect on the model, task, and predictions.

Conditional Supervised Contrastive Learning for Fair Text Classification

This work theoretically analyze the connections between learning representations with a fairness constraint and conditional supervised contrastive objectives, and proposes to use conditional supervised Contrastive objectives to learn fair representations for text classification via contrastive learning.

Contrastive Learning for Fair Representations

This work proposes a method for mitigating bias in classifier training by corporating contrastive learning, in which in- 009 stances sharing the same class label are encour- 010 aged to have similar representations, while in- 011 stances sharing a protected attribute are forced further apart.

OSCaR: Orthogonal Subspace Correction and Rectification of Biases in Word Embeddings

OSCaR (Orthogonal Subspace Correction and Rectification), a bias-mitigating method that focuses on disentangling biased associations between concepts instead of removing concepts wholesale, is proposed.

Learning Fair Representations via Rate-Distortion Maximization

A novel debiasing technique, Fairness-aware Rate Maximization (FaRM), that removes protected information by making representations of instances belonging to the same protected attribute class uncorrelated, using the rate-distortion function.



Controllable Invariance through Adversarial Feature Learning

This paper shows that the proposed framework induces an invariant representation, and leads to better generalization evidenced by the improved performance on three benchmark tasks.

Adversarial Removal of Demographic Attributes Revisited

It is shown that a diagnostic classifier trained on the biased baseline neural network also does not generalize to new samples, indicating that it relies on correlations specific to their particular data sample.

Disentangling factors of variation in deep representation using adversarial training

A conditional generative model for learning to disentangle the hidden factors of variation within a set of labeled observations, and separate them into complementary codes that are capable of generalizing to unseen classes and intra-class variabilities.

What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes

This work proposes a method for discouraging correlation between the predicted probability of an individual’s true occupation and a word embedding of their name, which leverages the societal biases that are encoded in word embeddings, eliminating the need for access to protected attributes.

Privacy-preserving Neural Representations of Text

This article measures the privacy of a hidden representation by the ability of an attacker to predict accurately specific private information from it and characterize the tradeoff between the privacy and the utility of neural representations.

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

This work empirically demonstrates that its algorithms significantly reduce gender bias in embeddings while preserving the its useful properties such as the ability to cluster related concepts and to solve analogy tasks.

Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them

Word embeddings are widely used in NLP for a vast range of tasks. It was shown that word embeddings derived from text corpora reflect gender biases in society, causing serious concern. Several recent

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

Cleaning the Null Space: A Privacy Mechanism for Predictors

Two algorithms aimed at providing privacy when the predictors have a linear operator in the first stage are described, which can sometimes be achieved with almost no effect on the effect of predicting desired information.

Equality of Opportunity in Supervised Learning

This work proposes a criterion for discrimination against a specified sensitive attribute in supervised learning, where the goal is to predict some target based on available features and shows how to optimally adjust any learned predictor so as to remove discrimination according to this definition.