Benchmarking Kappa: Interrater Agreement in Software ProcessAssessments

  • Authors:
  • Khaled El Emam

  • Affiliations:
  • Fraunhofer Institute for Experimental Software Engineering, Sauerwiesen 6, D-67661 Kaiserslautern, Germany

  • Venue:
  • Empirical Software Engineering
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

Softwareprocess assessments are by now a prevalent tool for process improvementand contract risk assessment in the software industry. Giventhat scores are assigned to processes during an assessment, aprocess assessment can be considered a subjective measurementprocedure. As with any subjective measurement procedure, thereliability of process assessments has important implicationson the utility of assessment scores, and therefore the reliabilityof assessments can be taken as a criterion for evaluating anassessment‘s quality. The particular type of reliability of interestin this paper is interrater agreement. Thus far, empirical evaluationsof the interrater agreement of assessments have used Cohen‘sKappa coefficient. Once a Kappa value has been derived, the nextquestion is ’’how good is it?‘‘ Benchmarks for interpreting theobtained values of Kappa are available from the social sciencesand medical literature. However, the applicability of these benchmarksto the software process assessment context is not obvious. Inthis paper we develop a benchmark for interpreting Kappa valuesusing data from ratings of 70 process instances collected fromassessments of 19 different projects in 7 different organizationsin Europe during the SPICE Trials (this is an international effortto empirically evaluate the emerging ISO/IEC 15504 InternationalStandard for Software Process Assessment). The benchmark indicatesthat Kappa values below 0.45 are poor, and values above 0.62constitute substantial agreement and should be the minimum aimedfor. This benchmark can be used to decide how good an assessment‘sreliability is.