Skip to main content
Log in

Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters

  • Theory and Methods
  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

Most inter-rater reliability studies using nominal scales suggest the existence of two populations of inference: the population of subjects (collection of objects or persons to be rated) and that of raters. Consequently, the sampling variance of the inter-rater reliability coefficient can be seen as a result of the combined effect of the sampling of subjects and raters. However, all inter-rater reliability variance estimators proposed in the literature only account for the subject sampling variability, ignoring the extra sampling variance due to the sampling of raters, even though the latter may be the biggest of the variance components. Such variance estimators make statistical inference possible only to the subject universe. This paper proposes variance estimators that will make it possible to infer to both universes of subjects and raters. The consistency of these variance estimators is proved as well as their validity for confidence interval construction. These results are applicable only to fully crossed designs where each rater must rate each subject. A small Monte Carlo simulation study is presented to demonstrate the accuracy of large-sample approximations on reasonably small samples.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bartfay, E., & Donner, A. (2001). Statistical inferences for inter-observer agreement studies with nominal outcome data. The Statistician, 50, 135–146.

    Google Scholar 

  • Bennet, E.M., Alpert, R., & Goldstein, A.C. (1954). Communications through limited response questioning. Public Opinion Quarterly, 18, 303–308.

    Article  Google Scholar 

  • Berry, K.J., & Mielke, P.W. Jr. (1988). A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educational and Psychological Measurement, 48, 921–933.

    Article  Google Scholar 

  • Brennan, R.L., & Prediger, D.J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41, 687–699.

    Article  Google Scholar 

  • Byrt, T., Bishop, J., & Carlin, J.B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46, 423–429.

    Article  PubMed  Google Scholar 

  • Cicchetti, D.V., & Feinstein, A.R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43, 551–558.

    Article  PubMed  Google Scholar 

  • Cochran, W.G. (1977). Sampling techniques (3rd ed.). New York: Wiley.

    Google Scholar 

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

    Article  Google Scholar 

  • Conger, A.J. (1980). Integration and generalization of kappas for multiple raters. Psychological Bulletin, 88, 322–328.

    Article  Google Scholar 

  • Cook, R.J. (1998). Kappa and its dependence on marginal rates. In P. Armitage & T. Colton (Eds.), Encyclopedia of biostatistics (pp. 2166–2168). New York: Wiley.

    Google Scholar 

  • Donner, A., & Eliasziw, M. (1992). A goodness-of-fit approach to inference procedures for the kappa statistic: Confidence interval construction, significance-testing and sample size estimation. Statistics in Medicine, 11, 1511–1519.

    Article  PubMed  Google Scholar 

  • Feinstein, A.R., & Cicchetti, D.V. (1990). High agreement but low kappa: I. The problems of two paradoxes. Journal of Clinical Epidemiology, 43, 543–549.

    Article  PubMed  Google Scholar 

  • Fleiss, J.L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76, 378–382.

    Article  Google Scholar 

  • Fuller, W.A., & Isaki, C.T. (1981). Survey design under superpopulation models. In D. Krewski, J.N.K. Rao, & R. Platek (Eds.), Current topics in survey sampling (pp. 199–226). New York: Academic Press.

    Google Scholar 

  • Goodman, L.A., & Kruskal, W.H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49, 1732–1769.

    Google Scholar 

  • Gwet, K. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1).

  • Holley, J.W., & Guilford, J.P. (1964). A note on the G index of agreement. Educational and Psychological Measurement, 24, 749–753.

    Article  Google Scholar 

  • Isaki, C.T., & Fuller, W.A. (1982). Survey design under the regression superpopulation model. Journal of the American Statistical Association, 77, 89–96.

    Article  Google Scholar 

  • Janson, H., & Olsson, U. (2001). A measure of agreement for interval or nominal multivariate observations. Educational and Psychological Measurement, 61, 277–289.

    Article  Google Scholar 

  • Janson, H., & Olsson, U. (2004). A measure of agreement for interval or nominal multivariate observations by different sets of judges. Educational and Psychological Measurement, 64, 62–70.

    Article  Google Scholar 

  • Janson, S., & Vegelius, J. (1979). On generalizations of the G index and the PHI coefficient to nominal scales. Multivariate Behavioral Research, 14, 255–269.

    Article  Google Scholar 

  • Kraemer, H.C., Periyakoil, V.S., & Noda, A. (2002). Kappa coefficients in medical research. Statistics in Medicine, 21, 2109–2129.

    Article  Google Scholar 

  • Landis, R.J., & Koch, G.G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, 33, 363–374.

    Article  PubMed  Google Scholar 

  • Light, R.J. (1971). Measures of response agreement for qualitative data: Some generalizations and alternatives. Psychological Bulletin, 76, 365–377.

    Article  Google Scholar 

  • Maxwell, A.E. (1977). Coefficients of agreement between observers and their interpretation. British Journal of Psychiatry, 130, 79–83.

    Article  PubMed  Google Scholar 

  • McGraw, K.O., & Wong, S.P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1, 30–46.

    Article  Google Scholar 

  • Nam, J.M. (2000). Interval estimation of the kappa coefficient with binary classification and an equal marginal probability model. Biometrics, 56, 583–585.

    Article  PubMed  Google Scholar 

  • Rao, C.R. (2002). Wiley series in probability and statistics. Linear statistical inference and its applications (2nd ed.).

  • Schuster, C. (2004). A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educational and Psychological Measurement, 64, 243–253.

    Article  Google Scholar 

  • Schuster, C., & Smith, D.A. (2006). Estimating with a latent class model the reliability of nominal judgments upon which two raters agree. Educational and Psychological Measurement, 66, 739–747.

    Article  Google Scholar 

  • Scott, W.A. (1955). Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly, XIX, 321–325.

    Article  Google Scholar 

  • Simon, P. (2006). Including omission mistakes in the calculation of Cohen’s kappa and an analysis of the coefficient’s paradox features. Educational and Psychological Measurement, 66, 765–777.

    Article  Google Scholar 

  • Uebersax, J.S., & Grove, W.M. (1990). Latent class analysis of diagnostic agreement. Statistics in Medicine, 9, 559–572.

    Article  PubMed  Google Scholar 

  • Uebersax, J.S., & Grove, W.M. (1993). A latent trait finite mixture analysis of rating agreement. Biometrics, 49, 823–835.

    Article  PubMed  Google Scholar 

  • Zou, G., & Klar, N. (2005). A non-iterative confidence interval estimating procedure for the intraclass kappa statistic with multinomial outcomes. Biometrical Journal, 5, 682–690.

    Article  Google Scholar 

  • Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103, 374–378.

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kilem Li Gwet.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gwet, K.L. Variance Estimation of Nominal-Scale Inter-Rater Reliability with Random Selection of Raters. Psychometrika 73, 407–430 (2008). https://doi.org/10.1007/s11336-007-9054-8

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-007-9054-8

Keywords

Navigation