A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation

@inproceedings{Goutte2005API,
  title={A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation},
  author={Cyril Goutte and {\'E}ric Gaussier},
  booktitle={ECIR},
  year={2005}
}
We address the problems of 1/ assessing the confidence of the standard point estimates, precision, recall and F-score, and 2/ comparing the results, in terms of precision, recall and F-score, obtained using two different methods. To do so, we use a probabilistic setting which allows us to obtain posterior distributions on these performance indicators, rather than point estimates. This framework is applied to the case where different methods are run on different datasets from the same source, as… 

Computing Precision and Recall with Missing or Uncertain Ground Truth

TLDR
A probabilistic interpretation of both measures is developed and it is shown that, provided a sufficient number of data sources are available, it offers a viable performance measure to compare methods if no ground truth is available.

Bayes Test of Precision, Recall, and F1 Measure for Comparison of Two Natural Language Processing Models

TLDR
This study proposes to use a block-regularized 3×2 CV (3×2 BCV) in model comparison and proposes a novel Bayes test, which could directly compute the probabilities of the hypotheses on the basis of the posterior distributions and provide more informative decisions than the existing significance t-tests.

Statistical inference on recall, precision and average precision under random selection

  • P. ZhangWanhua Su
  • Computer Science
    2012 9th International Conference on Fuzzy Systems and Knowledge Discovery
  • 2012
TLDR
A framework for conducting statistical inference on recall, precision and average precision through establishing their asymptotic properties is developed and can be applied in other areas where ranking systems need to be evaluated, such as information retrieval.

Credible Intervals for Precision and Recall Based on a K-Fold Cross-Validated Beta Distribution

TLDR
This study proposes two posterior credible intervals for precision and recall based on K-fold cross-validated beta distributions, which are constructed based on the beta posterior distribution inferred by all K data sets corresponding to K confusion matrices from a K- foldcross-validation.

EPP: interpretable score of model predictive power

TLDR
A new EPP rating system for predictive models is introduced and numerous advantages for this system, First, differences in EPP scores have probabilistic interpretation, which can assess the probability that one model will achieve better performance than another and can be directly compared between datasets.

A Bayesian Hierarchical Model for Comparing Average F1 Scores

TLDR
A novel approach to explicitly modelling the uncertainty of average F1 scores through Bayesian reasoning is proposed, and it is demonstrated that it can provide much more comprehensive performance comparison between text classifiers than the traditional frequentist null hypothesis significance testing (NHST).

Confidence Interval for the Difference in Classification Error

TLDR
The use of Tango’s biostatistical test is proposed to compute consistent confidence intervals on the difference in classification errors on both classes to motivate the need for confidence in classifier evaluation at a level suitable for medical studies.

Estimating the Uncertainty of Average F1 Scores

TLDR
A novel approach to explicitly modelling the uncertainty of average F1 scores through Bayesian reasoning is proposed.

The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets

TLDR
It is shown that the visual interpretability of ROC plots in the context of imbalanced datasets can be deceptive with respect to conclusions about the reliability of classification performance, owing to an intuitive but wrong interpretation of specificity.

Hard and Soft Evaluation of NLP models with BOOtSTrap SAmpling - BooStSa

TLDR
BooStSa is presented, a tool that makes it easy to compute significance levels with the BOOtSTrap SAmpling procedure to evaluate models that predict not only standard hard labels but soft-labels as well.
...

References

SHOWING 1-10 OF 13 REFERENCES

Using statistical testing in the evaluation of retrieval experiments

TLDR
It is suggested that relevance feedback be evaluated from the perspective of the user and a number of different statistical tests are described for determining if differences in performance between retrieval methods are significant.

A NEW MEASURE OF RETRIEVAL EFFECTIVENESS (OR: WHAT’S WRONG WITH PRECISION AND RECALL)

TLDR
ADM (Average Distance Measure) turns out to be both adequate to measure the effectiveness of information retrieval systems, and useful for revealing some problems about precision and recall.

More accurate tests for the statistical significance of result differences

TLDR
It is found in a set of experiments that many commonly used tests often underestimate the significance and so are less likely to detect differences that exist between different techniques, including computationally-intensive randomization tests.

Statistical inference in retrieval effectiveness evaluation

  • J. Savoy
  • Computer Science
    Inf. Process. Manag.
  • 1997

Cumulated gain-based evaluation of IR techniques

TLDR
This article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position, and test results indicate that the proposed measures credit IR methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences.

Significance tests for the evaluation of ranking methods

TLDR
A statistical model is presented that interprets the evaluation of ranking methods as a random experiment and predicts the variability of evaluation results, so that appropriate significance tests for the results can be derived.

The TREC 2002 Filtering Track Report

TLDR
This report describes the TREC-10 filtering track, presents some evaluation results, and provides a general commentary on lessons learned from this year's track.

The jackknife, the bootstrap, and other resampling plans

The Jackknife Estimate of Bias The Jackknife Estimate of Variance Bias of the Jackknife Variance Estimate The Bootstrap The Infinitesimal Jackknife The Delta Method and the Influence Function

Bayesian inference in statistical analysis

TLDR
This chapter discusses Bayesian Assessment of Assumptions, which investigates the effect of non-Normality on Inferences about a Population Mean with Generalizations in the context of a Bayesian inference model.

A Statistical Analysis of the TREC-3 Data

A statistical analysis of the TREC-3 data shows that performance differences across queries is greater than performance differences across participants runs. Generally, groups of runs which do not