# Beyond kappa: A review of interrater agreement measures

@article{Banerjee1999BeyondKA, title={Beyond kappa: A review of interrater agreement measures}, author={M Banerjee and Michelle Hopkins Capozzoli and Laura McSweeney and Debajyoti Sinha}, journal={Canadian Journal of Statistics-revue Canadienne De Statistique}, year={1999}, volume={27}, pages={3-23} }

In 1960, Cohen introduced the kappa coefficient to measure chance-corrected nominal scale agreement between two raters. Since then, numerous extensions and generalizations of this interrater agreement measure have been proposed in the literature. This paper reviews and critiques various approaches to the study of interrater agreement, for which the relevant data comprise either nominal or ordinal categorical ratings from multiple raters. It presents a comprehensive compilation of the main… Expand

#### 820 Citations

Assessing agreement between raters from the point of coefficients and loglinear models

- 2017

Abstract: In square contingency tables, analysis of agreement between row and column classifications is of interest. For nominal categories, kappa coefficient is used to summarize the degree of… Expand

Some Statistical Aspects of Measuring Agreement Based on a Modified Kappa

- Mathematics
- 2009

The focus of this paper is the statistical inference of the problem of assessing agreement or disagreement between two raters who employ measurements on a two-level nominal scale. The purpose of this… Expand

On the Equivalence of Multirater Kappas Based on 2-Agreement and 3-Agreement with Binary Scores

- Mathematics
- 2012

Cohen’s kappa is a popular descriptive statistic for summarizing agreement between the classifications of two raters on a nominal scale. With raters there are several views in the literature on how… Expand

Computing inter-rater reliability and its variance in the presence of high agreement.

- Medicine, Mathematics
- The British journal of mathematical and statistical psychology
- 2008

This paper explores the origin of these limitations, and introduces an alternative and more stable agreement coefficient referred to as the AC1 coefficient, and proposes new variance estimators for the multiple-rater generalized pi and AC1 statistics, whose validity does not depend upon the hypothesis of independence between raters. Expand

Meta-analysis of Cohen’s kappa

- Psychology
- Health Services and Outcomes Research Methodology
- 2011

Cohen’s κ is the most important and most widely accepted measure of inter-rater reliability when the outcome of interest is measured on a nominal scale. The estimates of Cohen’s κ usually vary from… Expand

Assessing the inter-rater agreement for ordinal data through weighted indexes

- Mathematics, Medicine
- Statistical methods in medical research
- 2016

A modification of Fleiss’ kappa, not affected by paradoxes, is proposed, and subsequently generalized to the case of ordinal variables, which generalizes the use of s* to a bivariate case. Expand

Equivalences of weighted kappas for multiple raters

- Mathematics
- 2012

Abstract Cohen’s unweighted kappa and weighted kappa are popular descriptive statistics for measuring agreement between two raters on a categorical scale. With m ≥ 3 raters, there are several views… Expand

Bayesian inference for kappa from single and multiple studies.

- Mathematics, Medicine
- Biometrics
- 2000

Bayesian analysis for kappa that can be routinely implemented using Markov chain Monte Carlo methodology is described and extensive simulation is carried out to compare the performances of the Bayesian and the frequentist tests. Expand

A COMPARISON OF COHEN'S KAPPA AND AGREEMENT COEFFICIENTS BY CORRADO GINI

- Mathematics
- 2013

The paper compares four coefficients that can be used to summarize inter-rater agreement on a nominal scale. The coefficients are Cohen's kappa and three coefficients that were originally proposed by… Expand

Statistical description of interrater variability in ordinal ratings

- Computer Science, Medicine
- Statistical methods in medical research
- 2000

A new graphical approach to describing interrater variability that involves a simple frequency distribution display of the category probabilities and provides a simple visual summary of the rating data is presented. Expand

#### References

SHOWING 1-10 OF 78 REFERENCES

The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability

- Mathematics
- 1973

or weighted kappa (Spitzer, Cohen, Fleiss and Endicott, 1967; Cohen, 1968a). Kappa is the proportion of agreement corrected for chance, and scaled to vary from -1 to +1 so that a negative value… Expand

Measuring interrater reliability among multiple raters: an example of methods for nominal data.

- Computer Science, Medicine
- Statistics in medicine
- 1990

Modifications of previously published estimators appropriate for measurement of reliability in the case of stratified sampling frames are introduced and interpret these measures in view of standard errors computed using the jackknife. Expand

Comparison of the Null Distributions of Weighted Kappa and the C Ordinal Statistic

- Mathematics
- 1977

It frequently occurs in psychological research that an investigator is interested in assessing the ex tent of interrater agreement when the data are measured on an ordinal scale. This monte carlo… Expand

Bias, prevalence and kappa.

- Mathematics, Medicine
- Journal of clinical epidemiology
- 1993

New indices that provide independent measures of bias and prevalence, as well as of observed agreement, are defined and a simple formula is derived that expresses kappa in terms of these three indices. Expand

Another look at interrater agreement.

- Medicine
- Psychological bulletin
- 1988

Consideration of the properties of three chance-corrected measures of inter-rater agreement leads to the recommendation that a test of marginal homogeneity be conducted as a first step in the assessment of rater agreement. Expand

Extension of the kappa coefficient.

- Mathematics, Medicine
- Biometrics
- 1980

An extension of the kappa coefficient is proposed which is appropriate for use with multiple observations per subject and for multiple response choices per observation and to illustrate new approaches to difficult problems in evaluation of reliability. Expand

Assessing interrater agreement from dependent data.

- Mathematics, Medicine
- Biometrics
- 1997

This work investigates the use of a latent model proposed by Qu, Piedmonte, and Medendorp (1995) to estimate the correlation between raters for each method, and test for their equality. Expand

Measurement of interrater agreement with adjustment for covariates.

- Mathematics, Medicine
- Biometrics
- 1996

The kappa coefficient measures chance-corrected agreement between two observers in the dichotomous classification of subjects and assumes both raters have the same marginal probability of classification, but this probability may depend on one or more covariates. Expand

Modelling patterns of agreement and disagreement

- Mathematics, Medicine
- Statistical methods in medical research
- 1992

A survey of ways of statistically modelling patterns of observer agreement and disagreement is presented, with main emphasis on modelling inter-observer agreement for categorical responses, both for nominal and ordinal response scales. Expand

Large sample standard errors of kappa and weighted kappa.

- Mathematics
- 1969

The statistics kappa (Cohen, 1960) and weighted kappa (Cohen, 1968) were introduced to provide coefficients of agreement between two raters for nominal scales. Kappa is appropriate when all… Expand