Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

@article{Dror2017ReplicabilityAF,
  title={Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets},
  author={Rotem Dror and Gili Baumer and Marina Bogomolov and Roi Reichart},
  journal={Transactions of the Association for Computational Linguistics},
  year={2017},
  volume={5},
  pages={471-486}
}
With the ever growing amount of textual data from a large variety of languages, domains, and genres, it has become standard to evaluate NLP algorithms on multiple datasets in order to ensure a consistent performance across heterogeneous setups. However, such multiple comparisons pose significant challenges to traditional statistical analysis methods in NLP and can lead to erroneous conclusions. In this paper we propose a Replicability Analysis framework for a statistically sound analysis of… Expand
Community Perspective on Replicability in Natural Language Processing
TLDR
A survey is used to investigate how the NLP community perceives the topic of replicability in general, and confirms earlier observations, that successful reproducibility requires more than having access to code and data. Expand
Causal Inference in Natural Language Processing: Estimation, Prediction, Interpretation and Beyond
A fundamental goal of scientific research is to learn about causal relationships. However, despite its critical role in the life and social sciences, causality has not had the same importance inExpand
CogniVal: A Framework for Cognitive Word Embedding Evaluation
TLDR
This paper presents the first multi-modal framework for evaluating English word representations based on cognitive lexical semantics, and finds strong correlations in the results between cognitive datasets, across recording modalities and to their performance on extrinsic NLP tasks. Expand
The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing
TLDR
This opinion/ theoretical paper proposes a simple practical protocol for statistical significance test selection in NLP setups and accompanies this protocol with a brief survey of the most relevant tests. Expand
Verification, Reproduction and Replication of NLP Experiments: a Case Study on Parsing Universal Dependencies
As in any field of inquiry that depends on experiments, the verifiability of experimental studies is important in computational linguistics. Despite increased attention to verification of empiricalExpand
On the Choice of Auxiliary Languages for Improved Sequence Tagging
TLDR
It is shown that attention-based meta-embeddings can effectively combine pre-trained embeddings from different languages for sequence tagging and set new state-of-the-art results for part- of-speech tagging in five languages. Expand
Show Your Work: Improved Reporting of Experimental Results
TLDR
It is demonstrated that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best, and a novel technique is presented: expected validation performance of the best-found model as a function of computation budget. Expand
Flamingos and Hedgehogs in the Croquet-Ground: Teaching Evaluation of NLP Systems for Undergraduate Students
TLDR
The course Evaluation of NLP Systems was a discussion-based seminar that covered different aspects of evaluation in NLP, namely paradigms, common procedures, data annotation, metrics and measurements, statistical significance testing, best practices and common approaches in specific NLP tasks and applications. Expand
We Need to Talk about Standard Splits
TLDR
It is argued that randomly generated splits should be used in system evaluation, and replication and reproduction experiments with nine part-of-speech taggers published between 2000 and 2018 fail to reliably reproduce some rankings when repeat this analysis with randomly generated training-testing splits. Expand
Bayes Test of Precision, Recall, and F1 Measure for Comparison of Two Natural Language Processing Models
TLDR
This study proposes to use a block-regularized 3×2 CV (3×2 BCV) in model comparison and proposes a novel Bayes test, which could directly compute the probabilities of the hypotheses on the basis of the posterior distributions and provide more informative decisions than the existing significance t-tests. Expand
...
1
2
3
4
...

References

SHOWING 1-10 OF 80 REFERENCES
An Empirical Investigation of Statistical Significance in NLP
TLDR
Two aspects of the empirical behavior of paired significance tests for NLP systems are investigated, including when one system appears to outperform another, and once significance levels are computed, how well does the standard i.i.d. notion of significance hold up in practical settings where future distributions are neither independent nor identically distributed. Expand
Estimating effect size across datasets
TLDR
This short paper argues that in order to assess the robustness of NLP tools the authors need to evaluate them on diverse samples, and considers the problem of finding the most appropriate way to estimate the true effect size across datasets of their systems over their baselines. Expand
Polyglot: Distributed Word Representations for Multilingual NLP
TLDR
This work quantitatively demonstrates the utility of word embeddings by using them as the sole features for training a part of speech tagger for a subset of these languages and investigates the semantic features captured through the proximity of word groupings. Expand
Replication issues in syntax-based aspect extraction for opinion mining
TLDR
An empirical replicability study of three well-known algorithms for syntactic centric aspect-based opinion mining is introduced, showing that reproducing results continues to be a difficult endeavor, mainly due to the lack of details regarding preprocessing and parameter setting. Expand
Towards Robust Linguistic Analysis using OntoNotes
TLDR
An analysis of the performance of publicly available, state-of-the-art tools on all layers and languages in the OntoNotes v5.0 corpus should set the benchmark for future development of various NLP components in syntax and semantics, and possibly encourage research towards an integrated system that makes use of the various layers jointly to improve overall performance. Expand
Replicability of Research in Biomedical Natural Language Processing: a pilot evaluation for a coding task
TLDR
While all results were ultimately replicated, it was found that the systems were poorly rated by analysts on documentation aspects such as ”ease of understanding system requirements” and ”provision of information while system is running”. Expand
What’s in a p-value in NLP?
TLDR
It is shown that significance results following current research standards are unreliable and, in addition, very sensitive to sample size, covariates such as sentence length, as well as to the existence of multiple metrics. Expand
OntoNotes : A Large Training Corpus for Enhanced Processing
This paper describes a large multilingual richly annotated corpus which is being made available to the community. There is an emphasis on quality and consistency with interannotator agreement ratesExpand
Mimicking Word Embeddings using Subword RNNs
TLDR
MIMICK is presented, an approach to generating OOV word embeddings compositionally, by learning a function from spellings to distributionalembeddings by performing learning at the type level of the original word embedding corpus. Expand
CoNLL-X Shared Task on Multilingual Dependency Parsing
TLDR
How treebanks for 13 languages were converted into the same dependency format and how parsing performance was measured is described and general conclusions about multi-lingual parsing are drawn. Expand
...
1
2
3
4
5
...