Computing inter-rater reliability and its variance in the presence of high agreement.

@article{Gwet2008ComputingIR,
  title={Computing inter-rater reliability and its variance in the presence of high agreement.},
  author={Kilem L. Gwet},
  journal={The British journal of mathematical and statistical psychology},
  year={2008},
  volume={61 Pt 1},
  pages={
          29-48
        }
}
  • K. Gwet
  • Published 1 May 2008
  • Psychology
  • The British journal of mathematical and statistical psychology
Pi (pi) and kappa (kappa) statistics are widely used in the areas of psychiatry and psychological testing to compute the extent of agreement between raters on nominally scaled data. It is a fact that these coefficients occasionally yield unexpected results in situations known as the paradoxes of kappa. This paper explores the origin of these limitations, and introduces an alternative and more stable agreement coefficient referred to as the AC1 coefficient. Also proposed are new variance… 

Tables from this paper

Statistical inference of agreement coefficient between two raters with binary outcomes
  • T. Ohyama
  • Psychology
    Communications in Statistics - Theory and Methods
  • 2019
Abstract Scott’s pi and Cohen’s kappa are widely used for assessing the degree of agreement between two raters with binary outcomes. However, many authors have pointed out its paradoxical behavior,
A Study on Comparison of Generalized Kappa Statistics in Agreement Analysis
Agreement analysis is conducted to assess reliability among rating results performed repeatedly on the same subjects by one or more raters. The kappa statistic is commonly used when rating scales are
Testing the Difference of Correlated Agreement Coefficients for Statistical Significance
  • K. Gwet
  • Mathematics
    Educational and psychological measurement
  • 2016
TLDR
A technique similar to the classical pairwise t test for means, which is based on a large-sample linear approximation of the agreement coefficient is proposed, which requires neither advanced statistical modeling skills nor considerable computer programming experience.
A new coefficient of interrater agreement: The challenge of highly unequal category proportions.
We derive a general structure that encompasses important coefficients of interrater agreement such as the S-coefficient, Cohen's kappa, Scott's pi, Fleiss' kappa, Krippendorff's alpha, and Gwet's
Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters
Most inter-rater reliability studies using nominal scales suggest the existence of two populations of inference: the population of subjects (collection of objects or persons to be rated) and that of
Statistical inference of Gwet’s AC1 coefficient for multiple raters and binary outcomes
Abstract Cohen’s kappa and intraclass kappa are widely used for assessing the degree of agreement between two raters with binary outcomes. However, many authors have pointed out its paradoxical
Fleiss’ kappa statistic without paradoxes
The Fleiss’ kappa statistic is a well-known index for assessing the reliability of agreement between raters. It is used both in the psychological and in the psychiatric field. Unfortunately, the
How Robust Are Multirater Interrater Reliability Indices to Changes in Frequency Distribution?
ABSTRACT Interrater reliability studies are used in a diverse set of fields. Often, these investigations involve three or more raters, and thus, require the use of indices such as Fleiss’s kappa,
Implementing a General Framework for Assessing Interrater Agreement in Stata
  • Daniel Klein
  • Computer Science
    The Stata Journal: Promoting communications on statistics and Stata
  • 2018
TLDR
Gwent’s (2014, Handbook of Inter-Rater Reliability) recently developed framework of interrater agreement coefficients is reviewed and the kappaetc command is introduced, which implements this framework in Stata.
Large-Sample Variance of Fleiss Generalized Kappa
  • K. Gwet
  • Mathematics
    Educational and psychological measurement
  • 2021
TLDR
The purpose of this article is to show that the large-sample variance of Fleiss’ generalized kappa is systematically being misused, is invalid as a precision measure for kappa, and cannot be used for constructing confidence intervals.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 24 REFERENCES
Beyond kappa: A review of interrater agreement measures
In 1960, Cohen introduced the kappa coefficient to measure chance-corrected nominal scale agreement between two raters. Since then, numerous extensions and generalizations of this interrater
An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers.
TLDR
A subset of 'observers who demonstrate a high level of interobserver agreement can be identified by using pairwise agreement statistics betweeni each observer and the internal majority standard opinion on each subject.
Large sample standard errors of kappa and weighted kappa.
The statistics kappa (Cohen, 1960) and weighted kappa (Cohen, 1968) were introduced to provide coefficients of agreement between two raters for nominal scales. Kappa is appropriate when all
A Coefficient of Agreement for Nominal Scales
CONSIDER Table 1. It represents in its formal characteristics a situation which arises in the clinical-social-personality areas of psychology, where it frequently occurs that the only useful level of
High agreement but low kappa: II. Resolving the paradoxes.
Integration and generalization of kappas for multiple raters.
J. A. Cohen's kappa (1960) for measuring agreement between 2 raters, using a nominal scale, has been extended for use with multiple raters by R. J. Light (1971) and J. L. Fleiss (1971). In the
Ramifications of a population model forκ as a coefficient of reliability
Coefficientκ is generally defined in terms of procedures of computation rather than in terms of a population. Here a population definition is proposed. On this basis, the interpretation ofκ as a
Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit.
  • J. Cohen
  • Physics
    Psychological bulletin
  • 1968
TLDR
The Kw provides for the incorpation of ratio-scaled degrees of disagreement (or agreement) to each of the cells of the k * k table of joi.
Categorical data analysis (2nd ed.)
  • 2002
Measuring nominal scale agreement among many raters.
...
1
2
3
...