With Little Power Comes Great Responsibility

@inproceedings{Card2020WithLP,
  title={With Little Power Comes Great Responsibility},
  author={D. Card and Peter Henderson and Urvashi Khandelwal and Robin Jia and Kyle Mahowald and Dan Jurafsky},
  booktitle={EMNLP},
  year={2020}
}
Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings. By meta-analyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of… Expand
The statistical advantage of automatic NLG metrics at the system level
TLDR
This paper qualifies the notion that automatic metrics are not as good as humans in estimating systemlevel quality by applying a bias-variance-noise decomposition, and compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected. Expand
How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation
TLDR
This paper conducts two evaluation experiments on two aspects of summaries’ linguistic quality to compare Likert-type and ranking annotations and shows that best choice of evaluation method can vary from one aspect to another. Expand
Understanding Human Potentials for Evaluating Generative Models
TLDR
Focusing on natural language generation, a method to dynamically measure the required human annotations when evaluating models in a relative comparison setting is proposed, ensuring sufficient labelling to reach a confident decision on the optimal model with high probability when comparing two generative models. Expand
Is Automated Topic Model Evaluation Broken?: The Incoherence of Coherence
TLDR
A meta-analysis of topic modeling literature reveals a substantial standardization gap in automated topic modeling benchmarks, and systematically evaluates a dominant classical model and two state-of-the-art neural models on two commonly used datasets. Expand
All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text
TLDR
The role untrained human evaluations play in NLG evaluation is examined and three approaches for quickly training evaluators to better identify GPT3-authored text are explored and it is found that while evaluation accuracy improved up to 55%, it did not significantly improve across the three domains. Expand
FLEX: Unifying Evaluation for Few-Shot NLP
TLDR
The FLEX Principles are formulated, a set of requirements and best practices for unified, rigorous, valid, and cost-sensitive few-shot NLP evaluation that include Sample Size Design, a novel approach to benchmark design that optimizes statistical accuracy and precision while keeping evaluation costs manageable. Expand
The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation
TLDR
It is shown that AMT worker judgments improve when they are shown model-generated output alongside human-generated references, which enables the workers to better calibrate their ratings, and interviews with the English teachers provide deeper insights into the challenges of the evaluation process. Expand
Underreporting of errors in NLG output, and what to do about it
TLDR
There is a severe under-reporting of the different kinds of errors that Natural Language Generation systems make, and this position paper provides recommendations for error identification, analysis and reporting. Expand
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
TLDR
The surprising empirical finding that CLIP, a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references is reported. Expand
Scarecrow: A Framework for Scrutinizing Machine Text
TLDR
This work introduces a new structured, crowdsourced error annotation schema called SCARECROW, which covers the error phenomena found in real machine generated text and collects annotations for text generated by state-of-the-art systems with varying known performance levels. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 89 REFERENCES
The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing
TLDR
This opinion/ theoretical paper proposes a simple practical protocol for statistical significance test selection in NLP setups and accompanies this protocol with a brief survey of the most relevant tests. Expand
An Empirical Investigation of Statistical Significance in NLP
TLDR
Two aspects of the empirical behavior of paired significance tests for NLP systems are investigated, including when one system appears to outperform another, and once significance levels are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed. Expand
How many subjects? Statistical power analysis in research
TLDR
This stylishly slim volume presents the authors' method of sample size estimation for a wide variety of statistical procedures, along with much useful instruction and many insights into general experimental design. Expand
Show Your Work: Improved Reporting of Experimental Results
TLDR
It is demonstrated that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best, and a novel technique is presented: expected validation performance of the best-found model as a function of computation budget. Expand
Not All Claims are Created Equal: Choosing the Right Statistical Approach to Assess Hypotheses
TLDR
It is argued that practitioners should first decide their target hypothesis before choosing an assessment method, and best practices and guidelines tailored to NLP research are provided, as well as an easy-to-use package for Bayesian assessment of hypotheses, complementing existing tools. Expand
On Some Pitfalls in Automatic Evaluation and Significance Testing for MT
TLDR
In an experimental comparison of two statistical significance tests, it is shown that p-values are estimated more conservatively by approximate randomization than by bootstrap tests, thus increasing the likelihood of type-I error for the latter. Expand
Random effects structure for confirmatory hypothesis testing: Keep it maximal.
TLDR
It is argued that researchers using LMEMs for confirmatory hypothesis testing should minimally adhere to the standards that have been in place for many decades, and it is shown thatLMEMs generalize best when they include the maximal random effects structure justified by the design. Expand
What’s in a p-value in NLP?
TLDR
It is shown that significance results following current research standards are unreliable and, in addition, very sensitive to sample size, covariates such as sentence length, as well as to the existence of multiple metrics. Expand
Statistical power in two-level models: A tutorial based on Monte Carlo simulation.
TLDR
A hands-on tutorial illustrating how a priori and post hoc power analyses for the most frequently used two-level models are conducted and case-sensitive rules of thumb for deriving sufficient sample sizes as well as minimum detectable effect sizes that yield a power ≥ .80 for the effects and input parameters most frequently analyzed by psychologists are provided. Expand
Statistical power and optimal design in experiments in which samples of participants respond to samples of stimuli.
TLDR
It is shown that in crossed designs, statistical power typically does not approach unity as the number of participants goes to infinity but instead approaches a maximum attainable power value that is possibly small, depending on the stimulus sample. Expand
...
1
2
3
4
5
...