Automated assessment of short free-text responses in computer science using latent semantic analysis

  • Authors:
  • Richard Klein;Angelo Kyrilov;Mayya Tokman

  • Affiliations:
  • University of the Witwatersrand, Johannesburg, South Africa;University of the Witwatersrand, Johannesburg, South Africa;University of California Merced, Merced, CA, USA

  • Venue:
  • Proceedings of the 16th annual joint conference on Innovation and technology in computer science education
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the last few decades, much research has focused on the evaluation and assessment of students' knowledge. The idea that computers can now be used to aid assessment is appealing. While implementing automatic marking of multiple choice questions is trivial, most educators agree that such form of assessment provides only limited insight into students' knowledge. Due to this limitation, teachers prefer to use unstructured questions in assessments. These questions, however, are much more complicated to mark as the semantic meaning of a response is more important than any of the individual keywords. Latent Semantic Analysis (LSA) has a remarkable ability to infer meaning from a text in this way. This paper describes the design, implementation and evaluation of an automatic marking system based on LSA and designed to grade paragraph responses to exam questions. In addition to presenting the algorithm and the theoretical basis of the system, we describe the tests that were conducted to test its efficacy. The tests included comparing the marks for several computer science courses exams generated by the system with the original grades awarded by a human examiner. The various settings of the system are studied to understand their effect on the accuracy. Using this understanding, along with trial and error, good configurations for each question are found. Under ideal configurations the system is capable of generating marks with correlations above 0.80 compared to the human examiner's grades. This is considered an acceptable variance. Generating these ideal configurations is nontrivial but possible by designing effective ways to extract appropriate settings from features of the data. If there is enough training data the system easily performs at rates that match the inter-correlation between human markers.