Human fallibility: how well do human markers agree?

Authors:
Debra Haley;Pete Thomas;Marian Petre;Anne de Roeck
Affiliations:
The Open University, Milton Keynes, UK;The Open University, Milton Keynes, UK;The Open University, Milton Keynes, UK;The Open University, Milton Keynes, UK
Venue:
ACE '09 Proceedings of the Eleventh Australasian Conference on Computing Education - Volume 95
Year:
2009

Citing 4
Cited 0

Effective electronic marking for on-line assessment

ITiCSE '98 Proceedings of the 6th annual conference on the teaching of computing and the 3rd annual conference on Integrating technology into computer science education: Changing the delivery of computer science education
Automated assessment and marking of spreadsheet concepts

Proceedings of the 2nd Australasian conference on Computer science education
How shall we assess this?

Working group reports from ITiCSE on Innovation and technology in computer science education
Five myths of assessment

ACE '04 Proceedings of the Sixth Australasian Conference on Computing Education - Volume 30

Quantified Score

Hi-index	0.00

Visualization

Abstract

Marker bias and inconsistency are widely seen as problems in the field of assessment. Various institutions have put in place a practice of second and even third marking to promote fairness. However, we were able to find very little evidence, rather than anecdotal reports, of human fallibility to justify the effort and expense of 2nd marking. This paper fills that gap by providing the results of a large-scale study that compared 5 human markers marking 18 different questions each with 50 student answers in the field of Computer Science. The study found that the human inter-later reliability (IRR) ranged broadly both over a particular question and over the 18 questions. This paper uses the Gwet AC1 statistic to measure the inter-rater reliability of 5 markers. The study was motivated by the desire to assess the accuracy of a computer assisted assessment (CAA) system we are developing. We claim that a CAA system does not need to be more accurate than human markers. Thus, we needed to quantify how accurate human markers are.