Beyond kappa: A review of interrater agreement measures

  title={Beyond kappa: A review of interrater agreement measures},
  author={M Banerjee and Michelle Hopkins Capozzoli and Laura McSweeney and Debajyoti Sinha},
  journal={Canadian Journal of Statistics-revue Canadienne De Statistique},
In 1960, Cohen introduced the kappa coefficient to measure chance-corrected nominal scale agreement between two raters. Since then, numerous extensions and generalizations of this interrater agreement measure have been proposed in the literature. This paper reviews and critiques various approaches to the study of interrater agreement, for which the relevant data comprise either nominal or ordinal categorical ratings from multiple raters. It presents a comprehensive compilation of the main… Expand

Tables from this paper

Assessing agreement between raters from the point of coefficients and loglinear models
Abstract: In square contingency tables, analysis of agreement between row and column classifications is of interest. For nominal categories, kappa coefficient is used to summarize the degree ofExpand
Some Statistical Aspects of Measuring Agreement Based on a Modified Kappa
The focus of this paper is the statistical inference of the problem of assessing agreement or disagreement between two raters who employ measurements on a two-level nominal scale. The purpose of thisExpand
On the Equivalence of Multirater Kappas Based on 2-Agreement and 3-Agreement with Binary Scores
Cohen’s kappa is a popular descriptive statistic for summarizing agreement between the classifications of two raters on a nominal scale. With raters there are several views in the literature on howExpand
Computing inter-rater reliability and its variance in the presence of high agreement.
  • K. Gwet
  • Medicine, Mathematics
  • The British journal of mathematical and statistical psychology
  • 2008
This paper explores the origin of these limitations, and introduces an alternative and more stable agreement coefficient referred to as the AC1 coefficient, and proposes new variance estimators for the multiple-rater generalized pi and AC1 statistics, whose validity does not depend upon the hypothesis of independence between raters. Expand
Meta-analysis of Cohen’s kappa
  • Shuyan Sun
  • Psychology
  • Health Services and Outcomes Research Methodology
  • 2011
Cohen’s κ is the most important and most widely accepted measure of inter-rater reliability when the outcome of interest is measured on a nominal scale. The estimates of Cohen’s κ usually vary fromExpand
Assessing the inter-rater agreement for ordinal data through weighted indexes
A modification of Fleiss’ kappa, not affected by paradoxes, is proposed, and subsequently generalized to the case of ordinal variables, which generalizes the use of s* to a bivariate case. Expand
Equivalences of weighted kappas for multiple raters
Abstract Cohen’s unweighted kappa and weighted kappa are popular descriptive statistics for measuring agreement between two raters on a categorical scale. With m ≥ 3 raters, there are several viewsExpand
Bayesian inference for kappa from single and multiple studies.
Bayesian analysis for kappa that can be routinely implemented using Markov chain Monte Carlo methodology is described and extensive simulation is carried out to compare the performances of the Bayesian and the frequentist tests. Expand
The paper compares four coefficients that can be used to summarize inter-rater agreement on a nominal scale. The coefficients are Cohen's kappa and three coefficients that were originally proposed byExpand
Statistical description of interrater variability in ordinal ratings
  • J. Nelson, M. Pepe
  • Computer Science, Medicine
  • Statistical methods in medical research
  • 2000
A new graphical approach to describing interrater variability that involves a simple frequency distribution display of the category probabilities and provides a simple visual summary of the rating data is presented. Expand


The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability
or weighted kappa (Spitzer, Cohen, Fleiss and Endicott, 1967; Cohen, 1968a). Kappa is the proportion of agreement corrected for chance, and scaled to vary from -1 to +1 so that a negative valueExpand
Measuring interrater reliability among multiple raters: an example of methods for nominal data.
Modifications of previously published estimators appropriate for measurement of reliability in the case of stratified sampling frames are introduced and interpret these measures in view of standard errors computed using the jackknife. Expand
Comparison of the Null Distributions of Weighted Kappa and the C Ordinal Statistic
It frequently occurs in psychological research that an investigator is interested in assessing the ex tent of interrater agreement when the data are measured on an ordinal scale. This monte carloExpand
Bias, prevalence and kappa.
New indices that provide independent measures of bias and prevalence, as well as of observed agreement, are defined and a simple formula is derived that expresses kappa in terms of these three indices. Expand
Another look at interrater agreement.
  • R. Zwick
  • Medicine
  • Psychological bulletin
  • 1988
Consideration of the properties of three chance-corrected measures of inter-rater agreement leads to the recommendation that a test of marginal homogeneity be conducted as a first step in the assessment of rater agreement. Expand
Extension of the kappa coefficient.
An extension of the kappa coefficient is proposed which is appropriate for use with multiple observations per subject and for multiple response choices per observation and to illustrate new approaches to difficult problems in evaluation of reliability. Expand
Assessing interrater agreement from dependent data.
This work investigates the use of a latent model proposed by Qu, Piedmonte, and Medendorp (1995) to estimate the correlation between raters for each method, and test for their equality. Expand
Measurement of interrater agreement with adjustment for covariates.
The kappa coefficient measures chance-corrected agreement between two observers in the dichotomous classification of subjects and assumes both raters have the same marginal probability of classification, but this probability may depend on one or more covariates. Expand
Modelling patterns of agreement and disagreement
  • A. Agresti
  • Mathematics, Medicine
  • Statistical methods in medical research
  • 1992
A survey of ways of statistically modelling patterns of observer agreement and disagreement is presented, with main emphasis on modelling inter-observer agreement for categorical responses, both for nominal and ordinal response scales. Expand
Large sample standard errors of kappa and weighted kappa.
The statistics kappa (Cohen, 1960) and weighted kappa (Cohen, 1968) were introduced to provide coefficients of agreement between two raters for nominal scales. Kappa is appropriate when allExpand