Why We Don't Really Know What Statistical Significance Means: A Major Educational Failure

  title={Why We Don't Really Know What Statistical Significance Means: A Major Educational Failure},
  author={J. Scott Armstrong and Raymond Hubbard},
  journal={The Wharton School},
The Neyman-Pearson theory of hypothesis testing, with the Type I error rate, α, as the significance level, is widely regarded as statistical testing orthodoxy. Fisher’s model of significance testing, where the evidential p value denotes the level of significance, nevertheless dominates statistical testing practice. This paradox has occurred because these two incompatible theories of classical statistical testing have been anonymously mixed together, creating the false impression of a single… 
Significance Testing in Accounting Research: A Critical Evaluation Based on Evidence
From a survey of the papers published in leading accounting journals in 2014, we find that accounting researchers conduct significance testing almost exclusively at a conventional level of
A Decision-Theoretic Formulation of Fisher’s Approach to Testing
In Fisher’s interpretation of statistical testing, a test is seen as a ‘screening’ procedure; one either reports some scientific findings, or alternatively gives no firm conclusions. These choices
Tackling False Positives in Finance: A Statistical Toolbox With Applications
It is found that the positive results obtained under the p-value criterion cannot stand, when the toolbox is applied to three notable studies in finance.
Tackling False Positives in Business Research: A Statistical Toolbox with Applications
  • Jae H. Kim
  • Business
    Journal of Economic Surveys
  • 2018
Serious concerns have been raised that false positive findings are widespread in empirical research in business disciplines. This is largely because researchers almost exclusively adopt the ‘p‐value
Should significance testing be abandoned in machine learning?
It is demonstrated that the Jeffreys–Lindley paradox can have severe repercussions for the comparison of multiple classifiers over multiple benchmark data sets, and suggests that significance tests should not be used in such comparative studies.
Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers
  • D. Berrar
  • Computer Science
    Machine Learning
  • 2016
Confidence curves are proposed, which depict nested confidence intervals at all levels for the performance difference and enable us to assess the compatibility of an infinite number of null hypotheses with the experimental results.
Tackling False Positives in Finance: A Statistical Toolbox with Applications
Serious concerns have been raised that false positive findings are widespread in empirical research in business research including finance. This is largely because researchers almost exclusively
The Undetectable Difference: An Experimental Look at the ‘Problem’ of p-Values
In the face of continuing assumptions by many scientists and journal editors that p-values provide a gold standard for inference, counter warnings are published periodically. But the core problem is
Research Commentary - Too Big to Fail: Large Samples and the p-Value Problem
This research commentary recommends a series of actions the researcher can take to mitigate the p-value problem in large samples and illustrates them with an example of over 300,000 camera sales on eBay.
On the Jeffreys-Lindley Paradox and the Looming Reproducibility Crisis in Machine Learning
  • D. Berrar, W. Dubitzky
  • Philosophy
    2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
  • 2017
This paradox describes a statistical conundrum where the frequentist and Bayesian interpretation are diametrically opposed and might lead to a situation that is similar to the current reproducibility crisis in other fields of science.


Confusion Over Measures of Evidence (p's) Versus Errors (α's) in Classical Statistical Testing
Confusion surrounding the reporting and interpretation of results of classical statistical tests is widespread among applied researchers, most of whom erroneously believe that such tests are
p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate.
  • S. Goodman
  • Psychology
    American journal of epidemiology
  • 1993
An analysis using another method promoted by Fisher, mathematical likelihood, shows that the p value substantially overstates the evidence against the null hypothesis.
The Significance of Statistical Significance Tests in Marketing Research
Classical statistical significance testing is the primary method by which marketing researchers empirically test hypotheses and draw inferences about theories. The authors discuss the interpretation
SUMMARY THE attempt to reinterpret the common tests of significance used in scientific research as though they constituted some kind of acceptance procedure and led to "decisions" in Wald's sense,
Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers
Data analysis methods in psychology still emphasize statistical significance testing, despite numerous articles demonstrating its severe deficiencies. It is now possible to use meta-analysis to show
The null ritual : What you always wanted to know about significance testing but were afraid to ask
One of us once had a student who ran an experiment for his thesis. Let us call him Pogo. Pogo had an experimental group and a control group and found that the means of both groups were exactly the
Statistical Evidence: A Likelihood Paradigm
Although the likelihood paradigm has been around for some time, Royall's distinctive voice, combined with his contribution of several novel lines of argument, has given new impetus to a school of
The Power of Replications and Replications of Power
Abstract The purpose of this paper is to examine the impact of low statistical power on the process of research replication. The traditional model of interpreting the success of replication efforts
Are null results becoming an endangered species in marketing?
Editorial procedures in the social and biomedical sciences are said to promote studies that falsely reject the null hypothesis. This problem may also exist in major marketing journals. Of 692 papers