On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

  title={On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks},
  author={Stephen Mussmann and Robin Jia and Percy Liang},
Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., 99.99% of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA each have only 2.4% average precision when evaluated on realistically imbalanced test data. We… 

Figures and Tables from this paper

Deep Indexed Active Learning for Matching Heterogeneous Entity Representations

DIAL is proposed, a scalable active learning approach that jointly learns embeddings to maximize recall for blocking and accuracy for matching blocked pairs and an Index-By-Committee framework, where each committee member learns representations based on powerful pre-trained transformer language models.

Question Answering Infused Pre-training of General-Purpose Contextualized Representations

A bi-encoder QA model is trained, which independently encodes passages and questions, to match the predictions of a more accurate cross-encoding model on 80 million synthesized QA pairs, and shows large improvements over both RoBERTa-large and previous state-of-the-art results on zero-shot and few-shot paraphrase detection on four datasets.

ALLSH: Active Learning Guided by Local Sensitivity and Hardness

This work proposes to retrieve unlabeled samples with a local sensitivity and hardness-aware acquisition function that generates data copies through local perturbations and selects data points whose predictive likelihoods diverge the most from their copies.

Active Learning Helps Pretrained Models Learn the Intended Task

This work investigates whether pretrained models are better active learners, capable of disambiguating between the possible tasks a user may be trying to specify, and finds that better active learning is an emergent property of the pretraining process.

Multilingual Detection of Personal Employment Status on Twitter

Detecting disclosures of individuals’ employment status on social media can provide valuable information to match job seekers with suitable vacancies, offer social protection, or measure labor market

Revise and Resubmit: An Intertextual Model of Text-based Collaboration in Peer Review

This work proposes the first intertextual model of text-based collaboration, which encompasses three major phenomena that make up a full iteration of the review–revise–and-resubmit cycle: pragmatic tagging, linking and long-document version alignment.

Learning Adaptive Language Interfaces through Decomposition

A neural semantic parsing system that learns new high-level abstractions through decomposition is introduced, demonstrating the flexibility of modern neural systems, as well as the one-shot reliable generalization of grammar-based methods.



Passage Re-ranking with BERT

A simple re-implementation of BERT for query-based passage re-ranking on the TREC-CAR dataset and the top entry in the leaderboard of the MS MARCO passage retrieval task, outperforming the previous state of the art by 27% in MRR@10.

Undersampling Approach for Imbalanced Training Sets and Induction from Multi-label Text-Categorization Domains

The study reported in this paper shows how taking this aspect into consideration with a majority-class undersampling technique can indeed improve classification performance as measured by criteria common in text categorization: macro/micro precision, recall, and F1.

Selection Bias Explorations and Debias Methods for Natural Language Sentence Matching Datasets

This paper investigates the problem of selection bias on six NLSM datasets and finds that four out of them are significantly biased, and proposes a training and evaluation framework to alleviate the bias.

TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection

The approach establishes the state of the art on two well-known benchmarks, WikiQA and TREC-QA, achieving the impressive MAP scores and confirms the positive impact of TandA in an industrial setting, using domain specific datasets subject to different types of noise.

Position-aware Attention and Supervised Data Improve Slot Filling

An effective new model is proposed, which combines an LSTM sequence model with a form of entity position-aware attention that is better suited to relation extraction that builds TACRED, a large supervised relation extraction dataset obtained via crowdsourcing and targeted towards TAC KBP relations.

Reading Wikipedia to Answer Open-Domain Questions

This approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs, indicating that both modules are highly competitive with respect to existing counterparts.

A Compare-Aggregate Model with Latent Clustering for Answer Selection

A novel method for a sentence-level answer-selection task that is a fundamental problem in natural language processing by adopting a pretrained language model and proposing a novel latent clustering method to compute additional information within the target corpus.

WikiQA: A Challenge Dataset for Open-Domain Question Answering

The WIKIQA dataset is described, a new publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering, which is more than an order of magnitude larger than the previous dataset.

CQADupStack : Gold or Silver ?

The quality of a recently-released dataset for community question-answering (cQA) research, CQADupStack, is analysed and it is suggested that the number of duplicates can be increased by around 45%, by annotating only 0.0003% of all the question pairs in the data set.

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

This work presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT, and uses a self-supervised loss that focuses on modeling inter-sentence coherence.