Selection Bias Explorations and Debias Methods for Natural Language Sentence Matching Datasets

@inproceedings{Zhang2019SelectionBE,
  title={Selection Bias Explorations and Debias Methods for Natural Language Sentence Matching Datasets},
  author={Guanhua Zhang and Bing Bai and Jian Liang and Kun Bai and Shiyu Chang and Mo Yu and Conghui Zhu and Tiejun Zhao},
  booktitle={ACL},
  year={2019}
}
Natural Language Sentence Matching (NLSM) has gained substantial attention from both academics and the industry, and rich public datasets contribute a lot to this process. However, biased datasets can also hurt the generalization performance of trained models and give untrustworthy evaluation results. For many NLSM datasets, the providers select some pairs of sentences into the datasets, and this sampling procedure can easily bring unintended pattern, i.e., selection bias. One example is the… Expand
Unlearn Dataset Bias in Natural Language Inference by Fitting the Residual
TLDR
This work formalizes the concept of dataset bias under the framework of distribution shift and presents a simple debiasing algorithm based on residual fitting, which is called DRiFt, to design learning algorithms that guard against known dataset bias. Expand
Mitigating Annotation Artifacts in Natural Language Inference Datasets to Improve Cross-dataset Generalization Ability
TLDR
Experimental results demonstrate that the methods considered can alleviate the negative effect of the artifacts and improve the generalization ability of models. Expand
An Empirical Study on Model-agnostic Debiasing Strategies for Robust Natural Language Inference
TLDR
This paper benchmarks prevailing neural NLI models including pretrained ones on various adversarial datasets and tries to combat distinct known biases by modifying a mixture of experts (MoE) ensemble method and shows that it’s nontrivial to mitigate multiple NLI biases at the same time, and that model-level ensemble method outperforms MoE ensemble method. Expand
On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks
TLDR
This work collects training data with active learning, using a BERT-based embedding model to efficiently retrieve uncertain points from a very large pool of unlabeled utterance pairs and greatly improves average precision. Expand
Beyond Leaderboards: A survey of methods for revealing weaknesses in Natural Language Inference data and models
TLDR
This structured survey provides an overview of the evolving research area by categorising reported weaknesses in models and datasets and the methods proposed to reveal and alleviate those weaknesses for the English language. Expand
MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization
TLDR
This work proposes MATINF, the first jointly labeled large-scale dataset for classification, question answering and summarization, and benchmarks existing methods and a novel multi-task baseline overMATINF to inspire further research. Expand
WHY IS ATTENTION NOT SO INTERPRETABLE ?
Attention-based methods have played an important role in model interpretations, where the calculated attention weights are expected to highlight the critical parts of inputs (e.g., keywords inExpand
Why is Attention Not So Interpretable
TLDR
Theoretically analyze the combinatorial shortcuts, design one intuitive experiment to demonstrate their existence, and propose two methods to mitigate this issue, which show that the proposed methods can effectively improve the interpretability of attention mechanisms on a variety of datasets. Expand
Pointwise Paraphrase Appraisal is Potentially Problematic
TLDR
Although the standard way of fine-tuning BERT for paraphrase identification by pairing two sentences as one sequence results in a model with state-of-the-art performance, that model may perform poorly on simple tasks like identifying pairs with two identical sentences. Expand
Why is Attention Not So Attentive?
TLDR
It is revealed that one root cause of this phenomenon can be ascribed to the combinatorial shortcuts, which stand for that the models may not only obtain information from the highlighted parts by attention mechanisms but from the attention weights themselves. Expand
...
1
2
...

References

SHOWING 1-10 OF 48 REFERENCES
Measuring and Mitigating Unintended Bias in Text Classification
TLDR
A new approach to measuring and mitigating unintended bias in machine learning models is introduced, using a set of common demographic identity terms as the subset of input features on which to measure bias. Expand
Annotation Artifacts in Natural Language Inference Data
TLDR
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes. Expand
Sentence Pair Scoring: Towards Unified Framework for Text Comprehension
TLDR
A unified open source software framework with easily pluggable models and tasks, which enables us to experiment with multi-task reusability of trained sentence models and set a new state-of-art in performance on the Ubuntu Dialogue dataset. Expand
Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms
TLDR
This paper conducts a point-by-point comparative study between Simple Word-Embedding-based Models (SWEMs), consisting of parameter-free pooling operations, relative to word-embedding-based RNN/CNN models, and proposes two additional pooling strategies over learned word embeddings: a max-pooling operation for improved interpretability and a hierarchical pooling operation, which preserves spatial information within text sequences. Expand
A SICK cure for the evaluation of compositional distributional semantic models
TLDR
This work aims to help the research community working on compositional distributional semantic models (CDSMs) by providing SICK (Sentences Involving Compositional Knowldedge), a large size English benchmark tailored for them. Expand
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
TLDR
The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus. Expand
Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources
TLDR
Investigation of unsupervised techniques for acquiring monolingual sentence-level paraphrases from a corpus of temporally and topically clustered news articles collected from thousands of web-based news sources shows that edit distance data is cleaner and more easily-aligned than the heuristic data. Expand
Selection bias in the LETOR datasets
The LETOR datasets consist of data extracted from traditional IR test corpora. For each of a number of test topics, a set of documents has been extracted, in the form of features of eachExpand
Hypothesis Only Baselines in Natural Language Inference
TLDR
This approach, which is referred to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets and suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context. Expand
A large annotated corpus for learning natural language inference
TLDR
The Stanford Natural Language Inference corpus is introduced, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning, which allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time. Expand
...
1
2
3
4
5
...