• Corpus ID: 55767944

Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation

  title={Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness \& Correlation},
  author={David M. W. Powers},
Commonly used evaluation measures including Recall, Precision, F-Factor and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or bas e case levels of the statistic. Using these measures a sy tem that performs worse in the objective sense of Informedn ess, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction… 

Figures and Tables from this paper

The Problem with Kappa

It is shown that deploying a system in a context which has the opposite skew from its validation set can be expected to approximately negate Fleiss Kappa and halve Cohen Kappa but leave Powers Kappa unchanged, and the latter is thus most appropriate, whilst for comparison of behaviour, Matthews Correlation is recommended.

EPP: interpretable score of model predictive power

A new EPP rating system for predictive models is introduced and numerous advantages for this system, First, differences in EPP scores have probabilistic interpretation, which can assess the probability that one model will achieve better performance than another and can be directly compared between datasets.

Deep ROC Analysis and AUC as Balanced Average Accuracy to Improve Model Selection, Understanding and Interpretation

Deep ROC analysis is proposed to measure performance in multiple groups of predicted risk (like calibration), or groups of true positive rate or false positive rate, and a new interpretation of AUC in whole or part, as balanced average accuracy, relevant to individuals instead of pairs is provided.


To compare the adequacy of accuracy measures, a simulation study was performed under different scenarios and the results highlight the advantages and disadvantages of each procedure and advise the use of the φ index.

Evaluation Gaps in Machine Learning Practice

The evaluation gaps between the idealized breadth of evaluation concerns and the observed narrow focus of actual evaluations are examined, pointing the way towards more contextualized evaluation methodologies for robustly examining the trustworthiness of ML models.

Model Comparison and Calibration Assessment: User Guide for Consistent Scoring Functions in Machine Learning and Actuarial Practice

This user guide revisits statistical techniques to assess the calibration or adequacy of a model on the one hand, and to compare and rank different models on the other hand, emphasising the importance of specifying the prediction target functional at hand a priori.

PToPI: A Comprehensive Review, Analysis, and Knowledge Representation of Binary Classification Performance Measures/Metrics

A new exploratory table called PToPI (Periodic Table of Performance Instruments) for 29 measures and 28 metrics, which provides a new relational structure for the instruments including graphical, probabilistic, and entropic ones to see their properties and dependencies all in one place.

Interpretable meta-score for model performance

A comparison based on Elo ranking is presented, which offers a probabilistic interpretation of how much better one model is than another, and a unified benchmark ontology is proposed that provides a uniform description of benchmarks.

Multilabel Classification with Partial Abstention: Bayes-Optimal Prediction under Label Independence

This paper studies an extension of the setting of MLC, in which the learner is allowed to partially abstain from a prediction, that is, to deliver predictions on some but not necessarily all class labels, and shows MLC with partial abstention to be effective in the sense of reducing loss when being allowed to abstain.

Distributed Optimization of Classifier Committee Hyperparameters

This paper takes the well-known classifiers K-Nearest Neighbors and Naive Bayes where K (from KNN) and a-priori probabilities (from NB) are hyperparameters that influence accuracy.



Diversity of decision-making models and the measurement of interrater agreement.

In this article, diagnostic decision making is viewed as a special case of signal detection theory, where each diagnostic process is characterized by a function that relates the probability of a case receiving a positive diagnosis to the severity or salience of symptoms.

Rule Evaluation Measures: A Unifying View

This paper develops a unifying view on some of the existing measures for predictive and descriptive induction by means of contingency tables, and demonstrates that many rule evaluation measures developed for predictive knowledge discovery can be adapted to descriptive knowledge discovery tasks.

Calibration of ρ Values for Testing Precise Null Hypotheses

P values are the most commonly used tool to measure evidence against a hypothesis or hypothesized model. Unfortunately, they are often incorrectly viewed as an error probability for rejection of the

Assessing Agreement on Classification Tasks: The Kappa Statistic

What is wrong with reliability measures as they are currently used for discourse and dialogue work in computational linguistics and cognitive science, and it is argued that the field would be better off as a field adopting techniques from content analysis.

Is Human Learning Rational?

  • D. Shanks
  • Psychology
    The Quarterly journal of experimental psychology. A, Human experimental psychology
  • 1995
It is argued that accurate judgements are an emergent property of an associationist learning process of the sort that has become common in adaptive network models of cognition and is the “means” to a normative or statistical “end”.

A Coefficient of Agreement for Nominal Scales

CONSIDER Table 1. It represents in its formal characteristics a situation which arises in the clinical-social-personality areas of psychology, where it frequently occurs that the only useful level of

Improved likelihood ratio tests for complete contingency tables

SUMMARY Lawley (1956) describes how asymptotic likelihood ratio tests can in general be improved by multiplying the -2 log A test statistic by a multiplier chosen so that the null distribution of the

Approximating the Moments and Distribution of the Likelihood Ratio Statistic for Multinomial Goodness of Fit

Abstract Approximations were derived for the mean and variance of G 2, the likelihood ratio statistic for testing goodness of fit in a k cell multinomial distribution. These approximate moments,