Beyond kappa: A review of interrater agreement measures

  title={Beyond kappa: A review of interrater agreement measures},
  author={Mousumi Banerjee and Michelle Hopkins Capozzoli and Laura McSweeney and Debajyoti Sinha},
  journal={Canadian Journal of Statistics},
In 1960, Cohen introduced the kappa coefficient to measure chance‐corrected nominal scale agreement between two raters. Since then, numerous extensions and generalizations of this interrater agreement measure have been proposed in the literature. This paper reviews and critiques various approaches to the study of interrater agreement, for which the relevant data comprise either nominal or ordinal categorical ratings from multiple raters. It presents a comprehensive compilation of the main… 

Assessing agreement between raters from the point of coefficients and loglinear models

Abstract: In square contingency tables, analysis of agreement between row and column classifications is of interest. For nominal categories, kappa coefficient is used to summarize the degree of

Some Statistical Aspects of Measuring Agreement Based on a Modified Kappa

The focus of this paper is the statistical inference of the problem of assessing agreement or disagreement between two raters who employ measurements on a two-level nominal scale. The purpose of this

On the Equivalence of Multirater Kappas Based on 2-Agreement and 3-Agreement with Binary Scores

Cohen’s kappa is a popular descriptive statistic for summarizing agreement between the classifications of two raters on a nominal scale. With raters there are several views in the literature on how

Meta-analysis of Cohen’s kappa

  • Shuyan Sun
  • Psychology
    Health Services and Outcomes Research Methodology
  • 2011
Cohen’s κ is the most important and most widely accepted measure of inter-rater reliability when the outcome of interest is measured on a nominal scale. The estimates of Cohen’s κ usually vary from

Computing inter-rater reliability and its variance in the presence of high agreement.

  • K. Gwet
  • Psychology
    The British journal of mathematical and statistical psychology
  • 2008
This paper explores the origin of these limitations, and introduces an alternative and more stable agreement coefficient referred to as the AC1 coefficient, and proposes new variance estimators for the multiple-rater generalized pi and AC1 statistics, whose validity does not depend upon the hypothesis of independence between raters.

Equivalences of weighted kappas for multiple raters

Multi-rater delta: extending the delta nominal measure of agreement between two raters to many raters

The coefficient delta is extended from R = 2 raters to R’s kappa (coefficient multi-rater delta), demonstrating that it can be expressed in the kappa format and has the same advantages as the coefficient delta with regard to the type kappa classic coefficients.

Bayesian Inference for Kappa from Single and Multiple Studies

Bayesian analysis for kappa that can be routinely implemented using Markov chain Monte Carlo methodology is described and extensive simulation is carried out to compare the performances of the Bayesian and the frequentist tests.

Statistical description of interrater variability in ordinal ratings

A new graphical approach to describing interrater variability that involves a simple frequency distribution display of the category probabilities and provides a simple visual summary of the rating data is presented.


The paper compares four coefficients that can be used to summarize inter-rater agreement on a nominal scale. The coefficients are Cohen's kappa and three coefficients that were originally proposed by



The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability

or weighted kappa (Spitzer, Cohen, Fleiss and Endicott, 1967; Cohen, 1968a). Kappa is the proportion of agreement corrected for chance, and scaled to vary from -1 to +1 so that a negative value

Comparison of the Null Distributions of Weighted Kappa and the C Ordinal Statistic

It frequently occurs in psychological research that an investigator is interested in assessing the ex tent of interrater agreement when the data are measured on an ordinal scale. This monte carlo

Measuring interrater reliability among multiple raters: an example of methods for nominal data.

Modifications of previously published estimators appropriate for measurement of reliability in the case of stratified sampling frames are introduced and interpret these measures in view of standard errors computed using the jackknife.

Bias, prevalence and kappa.

Extension of the kappa coefficient.

An extension of the kappa coefficient is proposed which is appropriate for use with multiple observations per subject and for multiple response choices per observation and to illustrate new approaches to difficult problems in evaluation of reliability.

Assessing interrater agreement from dependent data.

This work investigates the use of a latent model proposed by Qu, Piedmonte, and Medendorp (1995) to estimate the correlation between raters for each method, and test for their equality.

Modelling patterns of agreement and disagreement

  • A. Agresti
  • Psychology
    Statistical methods in medical research
  • 1992
A survey of ways of statistically modelling patterns of observer agreement and disagreement is presented, with main emphasis on modelling inter-observer agreement for categorical responses, both for nominal and ordinal response scales.

Measurement of interrater agreement with adjustment for covariates.

The kappa coefficient measures chance-corrected agreement between two observers in the dichotomous classification of subjects and assumes both raters have the same marginal probability of classification, but this probability may depend on one or more covariates.

Large sample standard errors of kappa and weighted kappa.

The statistics kappa (Cohen, 1960) and weighted kappa (Cohen, 1968) were introduced to provide coefficients of agreement between two raters for nominal scales. Kappa is appropriate when all

The measurement of observer agreement for categorical data.

A general statistical methodology for the analysis of multivariate categorical data arising from observer reliability studies is presented and tests for interobserver bias are presented in terms of first-order marginal homogeneity and measures of interob server agreement are developed as generalized kappa-type statistics.