Detecting and avoiding likely false‐positive findings – a practical guide

@article{Forstmeier2017DetectingAA,
  title={Detecting and avoiding likely false‐positive findings – a practical guide},
  author={Wolfgang Forstmeier and Eric-Jan Wagenmakers and Timothy H. Parker},
  journal={Biological Reviews},
  year={2017},
  volume={92}
}
Recently there has been a growing concern that many published research findings do not hold up in attempts to replicate them. We argue that this problem may originate from a culture of ‘you can publish if you found a significant effect’. This culture creates a systematic bias against the null hypothesis which renders meta‐analyses questionable and may even lead to a situation where hypotheses become difficult to falsify. In order to pinpoint the sources of error and possible solutions, we… 
Perturbations on the uniform distribution of p-values can lead to misleading inferences from null-hypothesis testing
Abstract Null-hypothesis testing (NHT) based on statistical significance is the most conventional statistical framework, on which neuroscientists rely for the analysis of their data. However, this
Evidence that nonsignificant results are sometimes preferred: Reverse P-hacking or selective reporting?
TLDR
If researchers less often report significant findings and/or reverse P-hack to avoid significant outcomes that undermine the ethos that experimental and control groups only differ with respect to actively manipulated variables, it is expected significant results from tests for group differences to be under-represented in the literature.
The statistical significance filter leads to overoptimistic expectations of replicability
Abstract It is well-known in statistics (e.g., Gelman & Carlin, 2014) that treating a result as publishable just because the p-value is less than 0.05 leads to overoptimistic expectations of
Do p Values Lose Their Meaning in Exploratory Analyses? It Depends How You Define the Familywise Error Rate
Several researchers have recently argued that p values lose their meaning in exploratory analyses due to an unknown inflation of the alpha level (e.g., Nosek & Lakens, 2014; Wagenmakers, 2016). For
Modern statistics, multiple testing and wishful thinking
  • G. Byrnes
  • Psychology, Medicine
    Occupational and Environmental Medicine
  • 2018
TLDR
An article in this issue by Lenters et al 1 uses simulation to address some questions which should be well understood in the epidemiology community, but sadly are not.
Multiplicity Eludes Peer Review: The Case of COVID-19 Research
TLDR
The need to pay special attention to the increased chance of false discoveries in observational studies, including non-replicated striking discoveries with a potentially large social impact, is concluded by an exploratory analysis of the Web of Science database for COVID-19 observational studies.
In Search of the Significant p. Its Influence on the Credibility of Publications
Publishing study results in a peer-reviewed journal represents the ultimate goal of research in any field of science and it is obviously assumed that the results are correct and supported by a
Does preregistration improve the credibility of research findings?
Preregistration entails researchers registering their planned research hypotheses, methods, and analyses in a time-stamped document before they undertake their data collection and analyses. This
Paths Explored, Paths Omitted, Paths Obscured: Decision Points & Selective Reporting in End-to-End Data Analysis
TLDR
This study pore over nine published research studies and conducts semi-structured interviews with their authors to confirm that researchers may experiment with choices in search of desirable results, but also identify other reasons why researchers explore alternatives yet omit findings.
Statistical Significance Testing at CHI PLAY: Challenges and Opportunities for More Transparency
TLDR
It is found that over half of these papers employ NHST without specific statistical hypotheses or research questions, which may risk the proliferation of false positive findings and provide a template for more transparent research and reporting practices.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 158 REFERENCES
The natural selection of bad science
TLDR
A 60-year meta-analysis of statistical power in the behavioural sciences is presented and it is shown that power has not improved despite repeated demonstrations of the necessity of increasing power, and that replication slows but does not stop the process of methodological deterioration.
Significance chasing in research practice: causes, consequences and possible solutions.
TLDR
Significance chasing, questionable research practices and poor study reproducibility are the unfortunate consequence of a 'publish or perish' culture and a preference among journals for novel findings.
Publication Bias: The "File-Drawer" Problem in Scientific Inference
Publication bias arises whenever the probability that a study is published depends on the statistical significance of its results. This bias, often called the file-drawer effect since the unpublished
Why Most Published Research Findings Are False
TLDR
Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.
False-Positive Psychology
TLDR
It is shown that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings, flexibility in data collection, analysis, and reporting dramatically increases actual false- positive rates, and a simple, low-cost, and straightforwardly effective disclosure-based solution is suggested.
Consequences of Prejudice Against the Null Hypothesis
The consequences of prejudice against accepting the null hypothesis were examined through (a) a mathematical model intended to stimulate the research-publication process and (b) case studies of
An Agenda for Purely Confirmatory Research
TLDR
This article proposes that researchers preregister their studies and indicate in advance the analyses they intend to conduct, and proposes that only these analyses deserve the label “confirmatory,” and only for these analyses are the common statistical tests valid.
Cryptic multiple hypotheses testing in linear models: overestimated effect sizes and the winner's curse
TLDR
Full model tests and P value adjustments can be used as a guide to how frequently type I errors arise by sampling variation alone, and favour the presentation of full models, since they best reflect the range of predictors investigated and ensure a balanced representation also of non-significant results.
Estimating the reproducibility of psychological science
TLDR
A large-scale assessment suggests that experimental reproducibility in psychology leaves a lot to be desired, and correlational tests suggest that replication success was better predicted by the strength of original evidence than by characteristics of the original and replication teams.
p-Curve and p-Hacking in Observational Research
TLDR
The p-curve for observational research in the presence of p-hacking is analyzed and it is shown that even with minimal omitted-variable bias (e.g., unaccounted confounding) p- Curve based on true effects and p-Curves based on null-effects with p-Hacking cannot be reliably distinguished.
...
1
2
3
4
5
...