• Corpus ID: 244714623

Understanding Out-of-distribution: A Perspective of Data Dynamics

  title={Understanding Out-of-distribution: A Perspective of Data Dynamics},
  author={Dyah Adila and Dongyeop Kang},
Despite machine learning models’ success in Natural Language Processing (NLP) tasks, predictions from these models frequently fail on out-of-distribution (OOD) samples. Prior works have focused on developing state-of-the-art methods for detecting OOD. The fundamental question of how OOD samples differ from indistribution samples remains unanswered. This paper explores how data dynamics in training models can be used to understand the fundamental differences between OOD and in-distribution… 

Figures and Tables from this paper

Generalizing to Unseen Domains: A Survey on Domain Generalization

This paper provides a formal definition of domain generalization and discusses several related fields, and categorizes recent algorithms into three classes and present them in detail: data manipulation, representation learning, and learning strategy, each of which contains several popular algorithms.



Behavior Analysis of NLI Models: Uncovering the Influence of Three Factors on Robustness

It is shown that evaluations of NLI models can benefit from studying the influence of factors intrinsic to the models or found in the dataset used, and three factors are identified - insensitivity, polarity and unseen pairs - and their impact on three SNLI models under a variety of conditions.

Understanding Failures in Out-of-Distribution Detection with Deep Generative Models

This work explains why deep generative models have been shown to assign higher probabilities or densities to ood images than images from the training distribution, and suggests that estimation error is a more plausible explanation than the misalignment between likelihood-based ood detection and out-distributions of interest.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks.

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

There is substantial room for improvement in NLI systems, and the HANS dataset can motivate and measure progress in this area, which contains many examples where the heuristics fail.

Analyzing the Behavior of Visual Question Answering Models

Today's VQA models are "myopic" (tend to fail on sufficiently novel instances), often "jump to conclusions" (converge on a predicted answer after 'listening' to just half the question), and are "stubborn" (do not change their answers across images).

Why Normalizing Flows Fail to Detect Out-of-Distribution Data

This work demonstrates that flows learn local pixel correlations and generic image-to-latent-space transformations which are not specific to the target image dataset, and shows that by modifying the architecture of flow coupling layers the authors can bias the flow towards learning the semantic structure of the target data, improving OOD detection.

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

The results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization, and a model-based tool to characterize and diagnose datasets.

Stress Test Evaluation for Natural Language Inference

This work proposes an evaluation methodology consisting of automatically constructed “stress tests” that allow us to examine whether systems have the ability to make real inferential decisions, and reveals strengths and weaknesses of these models with respect to challenging linguistic phenomena.

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

The Multi-Genre Natural Language Inference corpus is introduced, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding and shows that it represents a substantially more difficult task than does the Stanford NLI corpus.

Deep Generative Models Strike Back! Improving Understanding and Evaluation in Light of Unmet Expectations for OoD Data

It is shown that data-sets such as MNIST fashion/digits and CIFAR10/SVHN are trivially separable and have no overlap on their respective data manifolds that explains the higher OoD likelihood, and dimensionality reduction using PCA is shown to improve anomaly detection in generative models.