Evaluating Gender Bias in Machine Translation

  title={Evaluating Gender Bias in Machine Translation},
  author={Gabriel Stanovsky and Noah A. Smith and Luke Zettlemoyer},
We present the first challenge set and evaluation protocol for the analysis of gender bias in machine translation (MT. [] Key Method We devise an automatic gender bias evaluation method for eight target languages with grammatical gender, based on morphological analysis (e.g., the use of female inflection for the word "doctor"). Our analyses show that four popular industrial MT systems and two recent state-of-the-art academic MT models are significantly prone to gender-biased translation errors for all…

Figures and Tables from this paper

Extending Challenge Sets to Uncover Gender Bias in Machine Translation: Impact of Stereotypical Verbs and Adjectives

WiBeMT is presented, which adds gender- biased adjectives and sentences with gender-biased verbs to the first challenge set, explicitly designed to measure the extent of gender bias in MT systems, and shows a gender bias for all three MT systems.

Evaluating Gender Bias in Hindi-English Machine Translation

This work attempts to evaluate and quantify the gender bias within a Hindi-English machine translation system by implementing a modified version of the existing TGBI metric based on the grammatical considerations for Hindi.

Improving Gender Translation Accuracy with Filtered Self-Training

A gender-filtered self-training technique to improve gender translation accuracy on unambiguously gendered inputs that uses a source monolingual corpus and an initial model to generate gender-specific pseudo-parallel corpora which are then added to the training data.

The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses

A new corpus for gender identification and rewriting in contexts involving one or two target users (I and/or You) – first and second grammatical persons with independent grammatical gender preferences in Arabic, a gender-marking morphologically rich language.

Gender Coreference and Bias Evaluation at WMT 2020

This work presents the largest evidence for gender coreference and bias in machine translation in more than 19 systems submitted to the WMT over four diverse target languages: Czech, German, Polish, and Russian.

GFST: Gender-Filtered Self-Training for More Accurate Gender in Translation

This work proposes gender-filtered self-training (GFST) to improve gender translation accuracy on unambiguously gendered inputs and evaluates GFST on translation from English into five languages, finding that it improves gender accuracy without damaging generic quality.

Collecting a Large-Scale Gender Bias Dataset for Coreference Resolution and Machine Translation

Grammatical patterns indicating stereotypical and non-stereotypical gender-role assignments are found in corpora from three domains, resulting in a first large-scale gender bias dataset of 108K diverse real-world English sentences, which lends itself to finetuning a coreference resolution model, finding it mitigates bias on a held out set.

Neural Machine Translation Doesn’t Translate Gender Coreference Right Unless You Make It

This paper proposes schemes for incorporating explicit word-level gender inflection tags into NMT, and proposes an extension to assess translations of gender-neutral entities from English given a corresponding linguistic convention, such as a non-binary inflection, in the target language.

Towards Cross-Lingual Generalization of Translation Gender Bias

This study applies the philosophy on the problem of translation gender bias to create a template that makes evaluation less tilted to specific types of language pairs, and indicates that the translation fluency and inference accuracy are not necessarily correlated.

Mitigating Gender Bias in Machine Translation with Target Gender Annotations

It is argued that the information necessary for an adequate translation can not always be deduced from the sentence being translated or even might depend on external knowledge, so a method for training machine translation systems to use word-level annotations containing information about subject’s gender is presented.



Equalizing Gender Bias in Neural Machine Translation with Word Embeddings Techniques

This work proposes, experiment and analyzes the integration of two debiasing techniques over GloVe embeddings in the Transformer translation architecture, and shows that the proposed system learns to equalize existing biases from the baseline system.

Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods

A data-augmentation approach is demonstrated that, in combination with existing word-embedding debiasing techniques, removes the bias demonstrated by rule-based, feature-rich, and neural coreference systems in WinoBias without significantly affecting their performance on existing datasets.

Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns

GAP, a gender-balanced labeled corpus of 8,908 ambiguous pronoun–name pairs sampled, is presented and released to provide diverse coverage of challenges posed by real-world text and shows that syntactic structure and continuous neural models provide promising, complementary cues for approaching the challenge.

Evaluating Discourse Phenomena in Neural Machine Translation

This article presents hand-crafted, discourse test sets, designed to test the recently proposed multi-encoder NMT models’ ability to exploit previous source and target sentences, and explores a novel way of exploiting context from the previous sentence.

Understanding Back-Translation at Scale

This work broadens the understanding of back-translation and investigates a number of methods to generate synthetic source sentences, finding that in all but resource poor settings back-translations obtained via sampling or noised beam outputs are most effective.

Gender Bias in Coreference Resolution

A novel, Winograd schema-style set of minimal pair sentences that differ only by pronoun gender are introduced, and systematic gender bias in three publicly-available coreference resolution systems is evaluated and confirmed.

Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them

Word embeddings are widely used in NLP for a vast range of tasks. It was shown that word embeddings derived from text corpora reflect gender biases in society, causing serious concern. Several recent

A Challenge Set Approach to Evaluating Machine Translation

This work presents an English-French challenge set approach to translation evaluation and error analysis, and uses it to analyze phrase-based and neural systems, providing a more fine-grained picture of the strengths of neural systems.

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

This work proposes to inject corpus-level constraints for calibrating existing structured prediction models and design an algorithm based on Lagrangian relaxation for collective inference to reduce the magnitude of bias amplification in multilabel object classification and visual semantic role labeling.

Bleu: a Method for Automatic Evaluation of Machine Translation

This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run.