Significance tests for the evaluation of ranking methods

  title={Significance tests for the evaluation of ranking methods},
  author={Stefan Evert},
  • S. Evert
  • Published in COLING 23 August 2004
  • Economics
This paper presents a statistical model that interprets the evaluation of ranking methods as a random experiment. This model predicts the variability of evaluation results, so that appropriate significance tests for the results can be derived. The paper concludes with an empirical validation of the model on a collocation extraction task. 

Figures from this paper

A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation

A probabilistic setting is used which allows us to obtain posterior distributions on these performance indicators, rather than point estimates, and is applied to the case where different methods are run on different datasets from the same source.

A quantitative evaluation of keyword measures for corpus-based discourse analysis

We evaluated the automatic identification of keywords in our target corpus of German press texts on multiresistant pathogens; a recurring and highly controversial topic in public discourse (1.3M

Not All Claims are Created Equal: Choosing the Right Statistical Approach to Assess Hypotheses

It is argued that practitioners should first decide their target hypothesis before choosing an assessment method, and best practices and guidelines tailored to NLP research are provided, as well as an easy-to-use package for Bayesian assessment of hypotheses, complementing existing tools.

Not All Claims are Created Equal: Choosing the Right Approach to Assess Your Hypotheses

This work argues that practitioners should first decide their target hypothesis before choosing an assessment method, and provides best practices and guidelines tailored to NLP research, as well as an easy-to-use package called 'HyBayes' for Bayesian assessment of hypotheses, complementing existing tools.

Syntax-Based Extraction

Cross-language evaluation shows that, despite the inherent errors and the challenges posed by the analysis of large amounts of unrestricted text, deep parsing contributes to a significant increase in performance.

A Weighted Density-Based Approach for Identifying Standardized Items that are Significantly Related to the Biological Literature

The significance of relationships between textual data and information that is represented in standardized ontologies and protein domains is evaluated using a density-based approach that integrates a weighting system to account for many-to-many relationships.

Lexical affinities and language applications

The thesis develops two approaches for computing lexical affinity, and proposes two new point estimators for co-occurrence and evaluates the measures and the estimation procedures with synonym questions.

Computing Word Similarity and Identifying Cognates with Pair Hidden Markov Models

A system for computing similarity between pairs of words based on Pair Hidden Markov Models that has been used successfully for the alignment of biological sequences and outperforms previously proposed techniques is presented.

Feature selection in multiword expression recognition

The Statistics of Word Cooccur-rences: Word Pairs and Collocations

A platform weighing scale having a load receiving platform structure positioned over and resting on a multiplicity of load cells and having a load cell-removing aperture in registry with each load



More accurate tests for the statistical significance of result differences

It is found in a set of experiments that many commonly used tests often underestimate the significance and so are less likely to detect differences that exist between different techniques, including computationally-intensive randomization tests.

Methods for the Qualitative Evaluation of Lexical Association Measures

This paper presents methods for a qualitative, unbiased comparison of lexical association measures and the results obtained, and shows how estimates for the very large number of hapaxlegomena and double occurrences can be inferred from random samples.

Testing Statistical Hypotheses

This classic textbook, now available from Springer, summarizes developments in the field of hypotheses testing. Optimality considerations continue to provide the organizing principle. However, they

Accurate Methods for the Statistics of Surprise and Coincidence

The basis of a measure based on likelihood ratios that can be applied to the analysis of text is described, and in cases where traditional contingency table methods work well, the likelihood ratio tests described here are nearly identical.

Can we do better than frequency ? A case study on extracting PP-verb collocations

We argue that lexical association measures (AMs) should be evaluated against a reference set of collocations manually extracted from the full candidate data, and that the notion of collocation needs

Foundations of statistical natural language processing

This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.

Using Statistics in Lexical Analysis

The computational tools available for studying machine-readable corpora are at present still rather primitive and use these corpora and the basic concordancing tool mentioned above to fill in detailed syntactic descriptions (prompting a move, towards more thorough descriptions of lexical syntax).

Text Analysis Meets Computational Lexicography

A recursive chunker for unrestricted German text within the framework of the IMS Corpus Workbench (CWB) and considers all of the additional information needed for this task such as head lemma, morpho-syntactic information, and lexical or semantic properties, which are useful if not necessary for extraction processes.

Off-line (and on-line) text analysis for computational lexicography

The research reported on in this thesis was based on work in two projects: the DFG-project Deutsches Referenzkorpus, a joint project of the Institut für Maschinelle Sprachverarbeitung (IMS) in Stuttgart and the D FG-Transferbereich project Automatische Exzerption which is intended to support the transfer of know-how from universities to companies.

2000.The Usual Suspects: DataOriented Models for the Identification and Representation of Lexical Collocations

  • 2000