Confusion Over Measures of Evidence (p's) Versus Errors (α's) in Classical Statistical Testing

@article{Hubbard2003ConfusionOM,
  title={Confusion Over Measures of Evidence (p's) Versus Errors ($\alpha$'s) in Classical Statistical Testing},
  author={Raymond Hubbard and Maria J. Bayarri},
  journal={The American Statistician},
  year={2003},
  volume={57},
  pages={171 - 178}
}
Confusion surrounding the reporting and interpretation of results of classical statistical tests is widespread among applied researchers, most of whom erroneously believe that such tests are prescribed by a single coherent theory of statistical inference. This is not the case: Classical statistical testing is an anonymous hybrid of the competing and frequently contradictory approaches formulated by R. A. Fisher on the one hand, and Jerzy Neyman and Egon Pearson on the other. In particular… 

The widespread misinterpretation of p-values as error probabilities

The anonymous mixing of Fisherian (p-values) and Neyman–Pearsonian (α levels) ideas about testing, distilled in the customary but misleading p < α criterion of statistical significance, has led

Null hypothesis significance tests. A mix-up of two different theories: the basis for widespread confusion and numerous misinterpretations

TLDR
The theoretical origins of NHST are introduced to the scientometric community, which is mostly absent from standard statistical textbooks, and some of the most prevalent problems relating to the practice are discussed and traced back to the mix-up of the two different theoretical origins.

On Some Assumptions of the Null Hypothesis Statistical Testing

TLDR
This article presents the steps to compute s-values and, in order to illustrate the methods, some standard examples are analyzed and compared with p- values, denunciate that p-values, as opposed to s- Values, fail to hold some logical relations.

A Decision-Theoretic Formulation of Fisher’s Approach to Testing

In Fisher’s interpretation of statistical testing, a test is seen as a ‘screening’ procedure; one either reports some scientific findings, or alternatively gives no firm conclusions. These choices

General Testing Fisher , Neyman , Pearson , and Bayes

One of the famous controversies in statistics is the dispute between Fisher and Neyman-Pearson about the proper way to conduct a test. Hubbard and Bayarri (2003) gave an excellent account of the

Significance Testing Needs a Taxonomy

TLDR
Neyman and Pearson’s approach in the application of statistical analyses using alpha and beta error rates has played a dominant role guiding inferential judgments, appropriately in highly determined situations and inappropriately in scientific exploration.

Statistical Inference as Severe Testing

TLDR
This book pulls back the cover on disagreements between experts charged with restoring integrity to science, and denies two pervasive views of the role of probability in inference: to assign degrees of belief, and to control error rates in a long run.

Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing

Reporting p values from statistical significance tests is common in psychology's empirical literature. Sir Ronald Fisher saw the p value as playing a useful role in knowledge development by acting as

Reverse-Bayes analysis of two common misinterpretations of significance tests

TLDR
The implications of two common mistakes in the interpretation of statistical significance tests imply strong and often unrealistic assumptions on the prior proportion or probability of truly effective treatments.

The Role of p-Values in Judging the Strength of Evidence and Realistic Replication Expectations

Abstract p-Values are viewed by many as the root cause of the so-called replication crisis, which is characterized by the prevalence of positive scientific findings that are contradicted in
...

References

SHOWING 1-10 OF 59 REFERENCES

p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate.

  • S. Goodman
  • Psychology
    American journal of epidemiology
  • 1993
TLDR
An analysis using another method promoted by Fisher, mathematical likelihood, shows that the p value substantially overstates the evidence against the null hypothesis.

ON THE USE AND INTERPRETATION OF CERTAIN TEST CRITERIA FOR PURPOSES OF STATISTICAL INFERENCE PART I

In an earlier paper* we have endeavoured to emphasise the importance of placing in a logical sequence the stages of reasoning adopted in the solution of certain statistical problems, which may be

ON THE INTERPRETATION OF HYPOTHESIS TESTS FOLLOWING NEYMAN AND PEARSON

To begin with, Neyman and Pearson agreed with Fisher that the result in a hypothesis test is a measure of evidence. In their first joint paper, which was published in 1928, they declared that the

The Logic of Tests of Significance

TLDR
The goal of the paper is to describe precisely the pattern of inductive reasoning that is characteristic of Fisherian tests of significance, and to show that while it is far more cogent than Fisher's critics have realized, it does not logically sustain the inferences it sanctions.

Could Fisher, Jeffreys and Neyman Have Agreed on Testing?

Ronald Fisher advocated testing using p-values, Harold Jeffreys proposed use of objective posterior probabilities of hypotheses and Jerzy Neyman recommended testing with fixed error probabilities.

Tests of Significance in Theory and Practice

The best (most widely) received theory for tests of significance is that due largely to Fisher. Embellished with Neyman's mathematics, Fisher's theory is very well received. But Fisher's logic is not

Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence

Abstract The problem of testing a point null hypothesis (or a “small interval” null hypothesis) is considered. Of interest is the relationship between the P value (or observed significance level) and

P Values: What They are and What They are Not

Abstract P values (or significance probabilities) have been used in place of hypothesis tests as a means of giving more information about the relationship between the data and the hypothesis than

A comment on replication, p-values and evidence.

TLDR
It is shown that if the observed difference is the true one, the probability of repeating a statistically significant result, the 'replication probability', is substantially lower than expected.

Frequentist probability and frequentist statistics

TLDR
The stimulus is multiple: letters from friends calling my attention to a dispute in journal articles, in letters to editors, and in books, about what is described as 'the Neyman-Pearson school' and particularly what isdescribed as Neyman's 'radical' objectivism.
...