Human fallibility: how well do human markers agree?

  • Authors:
  • Debra Haley;Pete Thomas;Marian Petre;Anne de Roeck

  • Affiliations:
  • The Open University, Milton Keynes, UK;The Open University, Milton Keynes, UK;The Open University, Milton Keynes, UK;The Open University, Milton Keynes, UK

  • Venue:
  • ACE '09 Proceedings of the Eleventh Australasian Conference on Computing Education - Volume 95
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Marker bias and inconsistency are widely seen as problems in the field of assessment. Various institutions have put in place a practice of second and even third marking to promote fairness. However, we were able to find very little evidence, rather than anecdotal reports, of human fallibility to justify the effort and expense of 2nd marking. This paper fills that gap by providing the results of a large-scale study that compared 5 human markers marking 18 different questions each with 50 student answers in the field of Computer Science. The study found that the human inter-later reliability (IRR) ranged broadly both over a particular question and over the 18 questions. This paper uses the Gwet AC1 statistic to measure the inter-rater reliability of 5 markers. The study was motivated by the desire to assess the accuracy of a computer assisted assessment (CAA) system we are developing. We claim that a CAA system does not need to be more accurate than human markers. Thus, we needed to quantify how accurate human markers are.