Automatic essay grading using text categorization techniques
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Using register-diversified corpora for general language studies
Computational Linguistics - Special issue on using large corpora: II
Automated scoring using a hybrid feature identification technique
COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Automated rating of ESL essays
HLT-NAACL-EDUC '03 Proceedings of the HLT-NAACL 03 workshop on Building educational applications using natural language processing - Volume 2
A model for quantitative evaluation of an end-to-end question-answering system
Journal of the American Society for Information Science and Technology
Automatic identification of discourse moves in scientific article introductions
EANL '08 Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications
Diagnosing meaning errors in short answers to reading comprehension questions
EANL '08 Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications
Hi-index | 0.01 |
The e-rater system™ is an operational automated essay scoring system, developed at Educational Testing Service (ETS). The average agreement between human readers, and between independent human readers and e-rater is approximately 92%. There is much interest in the larger writing community in examining the system's performance on nonnative speaker essays. This paper focuses on results of a study that show e-rater's performance on Test of Written English (TWE) essay responses written by nonnative English speakers whose native language is Chinese, Arabic, or Spanish. In addition, one small sample of the data is from US-born English speakers, and another is from non-US-born candidates who report that their native language is English. As expected, significant differences were found among the scores of the English groups and the nonnative speakers. While there were also differences between e-rater and the human readers for the various language groups, the average agreement rate was as high as operational agreement. At least four of the five features that are included in e-rater's current operational models (including discourse, topical, and syntactic features) also appear in the TWE models. This suggests that the features generalize well over a wide range of linguistic variation, as e-rater was not confounded by non-standard English syntactic structures or stylistic discourse structures which one might expect to be a problem for a system designed to evaluate native speaker writing.