Measuring agreement in medical informatics reliability studies

Authors:
George Hripcsak;Daniel F. Heitjan
Affiliations:
Department of Medical Informatics, Columbia University, 622 West 168th Street, VC5, New York, NY;Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY
Venue:
Journal of Biomedical Informatics
Year:
2002

Citing 1
Cited 9

Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics

A variance-based measure of inter-rater agreement in medical databases

Journal of Biomedical Informatics
Presentation discovery: building a better icon

CHI '06 Extended Abstracts on Human Factors in Computing Systems
Fuzzy kappa for the agreement measure of fuzzy classifications

Neurocomputing
WayTracer: A mobile assistant for real-time logging of events and related positions

Computers in Human Behavior
An agreement measure for determining inter-annotator reliability of human judgements on affective text

HumanJudge '08 Proceedings of the Workshop on Human Judgements in Computational Linguistics
Selecting information in electronic health records for knowledge acquisition

Journal of Biomedical Informatics
Building an automated SOAP classifier for emergency department reports

Journal of Biomedical Informatics
Random indexing for finding similar nodes within large RDF graphs

ESWC'11 Proceedings of the 8th international conference on The Semantic Web
Detecting and resolving inconsistencies between domain experts' different perspectives on (classification) tasks

Artificial Intelligence in Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

Agreement measures are used frequently in reliability studies that involve categorical data. Simple measures like observed agreement and specific agreement can reveal a good deal about the sample. Chance-corrected agreement in the form of the kappa statistic is used frequently based on its correspondence to an intraclass correlation coefficient and the ease of calculating it, but its magnitude depends on the tasks and categories in the experiment. It is helpful to separate the components of disagreement when the goal is to improve the reliability of an instrument or of the raters. Approaches based on modeling the decision making process can be helpful here, including tetrachoric correlation, polychoric correlation, latent trait models, and latent class models. Decision making models can also be used to better understand the behavior of different agreement metrics. For example, if the observed prevalence of responses in one of two available categories is low, then there is insufficient information in the sample to judge raters' ability to discriminate cases, and kappa may underestimate the true agreement and observed agreement may overestimate it.