Posterior Calibrated Training on Sentence Classification Tasks

  title={Posterior Calibrated Training on Sentence Classification Tasks},
  author={Taehee Jung and Dongyeop Kang and Hua Cheng and L. Mentch and T. Schaaf},
Most classification models work by first predicting a posterior probability distribution over all classes and then selecting that class with the largest estimated probability. In many settings however, the quality of posterior probability itself (e.g., 65% chance having diabetes), gives more reliable information than the final predicted class alone. When these methods are shown to be poorly calibrated, most fixes to date have relied on posterior calibration, which rescales the predicted… Expand
Learning ULMFiT and Self-Distillation with Calibration for Medical Dialogue System
This paper investigates the well-calibrated model for ULMFiT and self-distillation (SD) in a medical dialogue system and empirically shows that the proposed methodologies outperform conventional methods in terms of accuracy and robustness. Expand
Joint Energy-based Model Training for Better Calibrated Natural Language Understanding Models
This work explores joint energy-based model (EBM) training during the finetuning of pretrained text encoders for natural language understanding (NLU) tasks and shows that EBM training can help the model reach a better calibration that is competitive to strong baselines, with little or no loss in accuracy. Expand


Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration
A natively multiclass calibration method applicable to classifiers from any model class, derived from Dirichlet distributions and generalising the beta calibration method from binary classification is proposed. Expand
On Calibration of Modern Neural Networks
It is discovered that modern neural networks, unlike those from a decade ago, are poorly calibrated, and on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. Expand
Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers
It is concluded that binning succeeds in significantly improving naive Bayesian probability estimates, while for improving decision tree probability estimates the recommend smoothing by -estimation and a new variant of pruning that is called curtailment. Expand
Posterior calibration and exploratory analysis for natural language processing models
It is argued that the quality of a model' s posterior distribution can and should be directly evaluated, as to whether probabilities correspond to empirical frequencies, and NLP uncertainty can be projected not only to pipeline components, but also to exploratory data analysis, telling a user when to trust and not trust the NLP analysis. Expand
The Importance of Calibration for Estimating Proportions from Annotations
This paper identifies and differentiate between two relevant data generating scenarios (intrinsic vs. extrinsic labels), introduces a simple but novel method which emphasizes the importance of calibration, and analyzes and experimentally validate the appropriateness of various methods for each of the two scenarios. Expand
Transforming classifier scores into accurate multiclass probability estimates
This work shows how to obtain accurate probability estimates for multiclass problems by combining calibrated binary probability estimates, and proposes a new method for obtaining calibrated two-class probability estimates that can be applied to any classifier that produces a ranking of examples. Expand
Verified Uncertainty Calibration
The scaling-binning calibrator is introduced, which first fits a parametric function to reduce variance and then bins the function values to actually ensure calibration, and estimates a model's calibration error more accurately using an estimator from the meteorological community. Expand
xSLUE: A Benchmark and Analysis Platform for Cross-Style Language Understanding and Evaluation
This paper provides a benchmark corpus (xSLUE) with an online platform (this http URL) for cross-style language understanding and evaluation and shows that some styles are highly dependent on each other, and some domains are stylistically more diverse than others. Expand