Redefine statistical significance

  title={Redefine statistical significance},
  author={Daniel J. Benjamin and James O. Berger and Magnus Johannesson and Brian A. Nosek and Eric-Jan Wagenmakers and Richard A. Berk and Kenneth A. Bollen and Bj{\"o}rn Brembs and Lawrence Brown and Colin Camerer and David Cesarini and Christopher D. Chambers and Merlise A. Clyde and Thomas D Cook and Paul De Boeck and Zoltan P. Di{\'e}n{\`e}s and Anna Dreber and Kenny Easwaran and Charles Efferson and Ernst Fehr and Fiona Fidler and Andy P. Field and Malcolm Forster and Edward I. George and Richard Gonzalez and Steven Goodman and Edwin Green and Donald P. Green and Anthony G Greenwald and Jarrod D. Hadfield and Larry V Hedges and Leonhard Held and Teck Hua Ho and Herbert Hoijtink and Daniel J. Hruschka and Kosuke Imai and Guido Imbens and John P. A. Ioannidis and Mi-hye Jeon and James Holland Jones and Michael Kirchler and David I. Laibson and John A. List and R. Little and Arthur Lupia and Edouard Machery and Scott E. Maxwell and Michael Mccarthy and Don A. Moore and Stephen L. Morgan and Marcus Robert Munafo and Shinichi Nakagawa and Brendan Nyhan and Timothy H. Parker and Luis R. Pericchi and Marco Perugini and Jeffrey N. Rouder and Judith Rousseau and Victoria Savalei and Felix D. Sch{\"o}nbrodt and Thomas M. Sellke and Betsy Sinclair and Dustin Tingley and Trisha Van Zandt and Simine Vazire and Duncan J. Watts and Christopher Winship and Robert L. Wolpert and Yumeng Xie and Cristobal Young and Jonathan Zinman and Valen E. Johnson},
  journal={Nature Human Behaviour},
We propose to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries. 
'One-size-fits-all’ threshold for P values under fire
Scientists hit back at a proposal to make it tougher to call findings statistically significant.
Justify your alpha
In response to recommendations to redefine statistical significance to P ≤ 0.005, we propose that researchers should transparently report and justify all choices they make when designing a study,
Correction to ‘The reproducibility of research and the misinterpretation of p-values'
This research presents a novel probabilistic procedure that allows for direct measurement of the response of the immune system to earthquake-triggered landsliding.
Interval estimation, point estimation, and null hypothesis significance testing calibrated by an estimated posterior probability of the null hypothesis
  • D. Bickel
  • Psychology
    Communications in Statistics - Theory and Methods
  • 2021
Much of the blame for failed attempts to replicate reports of scientific findings has been placed on ubiquitous and persistent misinterpretations of the p value. An increasingly popular solution is...
The p‐value statement, five years on
The American Statistical Association's 2016 p‐value statement generated debates and disagreements, editorials and symposia, and a plethora of ideas for how science could be changed for the better.
Evaluation of Lowering the P Value Threshold for Statistical Significance From .05 to .005 in Previously Published Randomized Clinical Trials in Major Medical Journals
This study evaluates primary end points in randomized clinical trials (RCTs) published in 3 major general medical journals to determine how changing the P value threshold for statistical significance
The Impact of P-hacking on “Redefine Statistical Significance”
  • Harry Crane
  • Economics
    Basic and Applied Social Psychology
  • 2018
Abstract In their proposal to “redefine statistical significance,” Benjamin et al. claim that lowering the default cutoff for statistical significance from .05 to .005 would “immediately improve the
Manipulating the Alpha Level Cannot Cure Significance Testing
We argue that making accept/reject decisions on scientific hypotheses, including a recent call for changing the canonical alpha level from p = 0.05 to p = 0.005, is deleterious for the finding of new
Threats of a replication crisis in empirical computer science
Research replication only works if there is confidence built into the results, and the results should be confidence-based.


Revised standards for statistical evidence
  • V. Johnson
  • Computer Science
    Proceedings of the National Academy of Sciences
  • 2013
Modifications of common standards of evidence are proposed to reduce the rate of nonreproducibility of scientific research by a factor of 5 or greater and to correct the problem of unjustifiably high levels of significance.
On the Reproducibility of Psychological Science
The results of this reanalysis provide a compelling argument for both increasing the threshold required for declaring scientific discoveries and for adopting statistical summaries of evidence that account for the high proportion of tested hypotheses that are false.
Calibration of ρ Values for Testing Precise Null Hypotheses
P values are the most commonly used tool to measure evidence against a hypothesis or hypothesized model. Unfortunately, they are often incorrectly viewed as an error probability for rejection of the
Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature
In light of the findings, the recently reported low replication success in psychology is realistic, and worse performance may be expected for cognitive neuroscience, and false report probability is likely to exceed 50% for the whole literature.
Effect sizes and p values: what should be reported and what should be replicated?
The most-criticized flaws of NHT can be avoided when the importance of a hypothesis is used to determine that a finding is worthy of report, and when p approximately equal to .05 is treated as insufficient basis for confidence in the replicability of an isolated non-null finding.
Beyond Power Calculations
  • A. Gelman, J. Carlin
  • Physics
    Perspectives on psychological science : a journal of the Association for Psychological Science
  • 2014
The largest challenge in a design calculation: coming up with reasonable estimates of plausible effect sizes based on external information is discussed, and design calculations in which the probability of an estimate being in the wrong direction and the magnitude of an effect might be overestimated are recommended.
Using prediction markets to estimate the reproducibility of scientific research
It is argued that prediction markets could be used to obtain speedy information about reproducibility at low cost and could potentially even beused to determine which studies to replicate to optimally allocate limited resources into replications.
The ASA Statement on p-Values: Context, Process, and Purpose
Cobb’s concern was a long-worrisome circularity in the sociology of science based on the use of bright lines such as p< 0.05: “We teach it because it’s what we do; we do it because it’s what we
Estimating the reproducibility of psychological science
A large-scale assessment suggests that experimental reproducibility in psychology leaves a lot to be desired, and correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
Evaluating replicability of laboratory experiments in economics
To contribute data about replicability in economics, 18 studies published in the American Economic Review and the Quarterly Journal of Economics between 2011 and 2014 are replicated, finding that two-thirds of the 18 studies examined yielded replicable estimates of effect size and direction.