P-Values: Misunderstood and Misused

  title={P-Values: Misunderstood and Misused},
  author={Bertie Vidgen and Taha Yasseri},
P-values are widely used in both the social and natural sciences to quantify the statistical significance of observed results. The recent surge of big data research has made the p-value an even more popular tool to test the significance of a study. However, substantial literature has been produced critiquing how p-values are used and understood. In this paper we review this recent critical literature, much of which is routed in the life sciences, and consider its implications for social… 

Figures and Tables from this paper

The use of p-values in applied research: Interpretation and new trends
Two alternative theoretical frameworks are reviewed, given by the use of Bayes factor and a recent proposal that leads to evaluate statistical hypotheses in terms of a priori and a posteriori odds ratios.
Comments From the Editor: How Big Are Your P Values?
  • B. Silvey
  • Education
    Update: Applications of Research in Music Education
  • 2021
For researchers who conduct quantitative analyses that involve statistical software such as SPSS or R, nothing is more perilous than hitting the execute button and waiting for those p values to
Beyond psychology: prevalence of p value and confidence interval misinterpretation across different fields
P values and confidence intervals (CIs) are the most widely used statistical indices in scientific literature. Several surveys have revealed that these two indices are generally misunderstood.
Why, When and How to Adjust Your P Values?
What the P value represents, and why and when it should be adjusted, is presented, and how to adjust P values for multiple testing in the R environment for statistical computing is shown.
Problem with p values: why p values do not tell you if your treatment is likely to work
The attractions of a rule-based ‘algorithm’ approach are that it is easy to implement, permits binary decisions to be made and makes it simple for investigators, editors, readers and funding bodies to count or discount the work.
A Biased Review of Biases in Twitter Studies on Political Collective Action
A minireview of Twitter-based research on political crowd behavior considers a small number of selected papers, analyzes their (often lack of) theoretical approaches, reviews their methodological innovations, and offers suggestions as to the relevance of their results for political scientists and sociologists.
What, when and where of petitions submitted to the UK government during a time of chaos
The results show the huge power of computationally analysing petitions to understand not only what issues citizens are concerned about but also when and from where their signatories are geographically located.
Bayesian model averaging: improved variable selection for matched case-control studies.
Bayesian model averaging is a conceptually simple, unified approach that produces robust results and can be used to replace controversial P-values for case-control study in medical research.
When Communication Meets Computation: Opportunities, Challenges, and Pitfalls in Computational Communication Science
This special issue discusses the validity of using big data in communication science and showcases a number of new methods and applications in the fields of text and network analysis.
Draw-A-Science-Comic: Alternative prompts and the presence of danger
The early years of primary school are important in shaping how children see scientists and science, but researching younger children is known to be difficult. The Draw-A-Scientist Test (DAST), in


To P or not to P: on the evidential nature of P-values and their place in scientific inference
It is shown that P-values quantify experimental evidence not by their numerical value, but through the likelihood functions that they index.
An investigation of the false discovery rate and the misinterpretation of p-values
It is concluded that if you wish to keep your false discovery rate below 5%, you need to use a three-sigma rule, or to insist on p≤0.001, and never use the word ‘significant’.
Final Collapse of the Neyman-Pearson Decision Theoretic Framework and Rise of the neoFisherian
This essay grew out of an examination of one-tailed significance testing. One-tailed tests were little advocated by the founders of modern statistics but are widely used and recommended nowadays in
Scientific method: Statistical errors
It turned out that the problem was not in the data or in Motyl's analyses, it lay in the surprisingly slippery nature of the P value, which is neither as reliable nor as objective as most scientists assume.
Sifting the evidence—what's wrong with significance tests?
The high volume and often contradictory nature5 of medical research findings, however, is not only because of publication bias, but also because of the widespread misunderstanding of the nature of statistical significance.
The Extent and Consequences of P-Hacking in Science
It is suggested that p-hacking probably does not drastically alter scientific consensuses drawn from meta-analyses, and its effect seems to be weak relative to the real effect sizes being measured.
Could Fisher, Jeffreys and Neyman Have Agreed on Testing?
Ronald Fisher advocated testing using p-values, Harold Jeffreys proposed use of objective posterior probabilities of hypotheses and Jerzy Neyman recommended testing with fixed error probabilities.
Publication bias in the social sciences: Unlocking the file drawer
Fully half of peer-reviewed and implemented social science experiments are not published, providing direct evidence of publication bias and identifying the stage of research production at which publication bias occurs: Authors do not write up and submit null findings.
Why Most Published Research Findings Are False
Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.
P Value and the Theory of Hypothesis Testing: An Explanation for New Researchers
The p value is the probability to obtain an effect equal to or more extreme than the one observed presuming the null hypothesis of no effect is true; it gives researchers a measure of the strength of evidence against thenull hypothesis.