A three-stage approach to the automated scoring of spontaneous spoken responses

Authors:
Derrick Higgins;Xiaoming Xi;Klaus Zechner;David Williamson
Affiliations:
Educational Testing Service, Rosedale Road, Princeton, NJ 08541, USA;Educational Testing Service, Rosedale Road, Princeton, NJ 08541, USA;Educational Testing Service, Rosedale Road, Princeton, NJ 08541, USA;Educational Testing Service, Rosedale Road, Princeton, NJ 08541, USA
Venue:
Computer Speech and Language
Year:
2011

Citing 6
Cited 1

The nature of statistical learning theory

The nature of statistical learning theory
Using Model Trees for Classification

Machine Learning
Induction of Decision Trees

Machine Learning
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Information Extraction and Machine Learning: Auto-Marking Short Free Text Responses to Science Questions

Proceedings of the 2005 conference on Artificial Intelligence in Education: Supporting Learning through Intelligent and Socially Informed Technology
Automatic scoring of non-native spontaneous speech in tests of spoken English

Speech Communication

Non-scorable response detection for automated speaking proficiency assessment

IUNLPBEA '11 Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a description and evaluation of SpeechRater^S^M, a system for automated scoring of non-native speakers' spoken English proficiency, based on tasks which elicit spontaneous monologues on particular topics. This system builds on much previous work in the automated scoring of test responses, but differs from previous work in that the highly unpredictable nature of the responses to this task type makes the challenge of accurate scoring much more difficult. SpeechRater uses a three-stage architecture. Responses are first processed by a filtering model to ensure that no exceptional conditions exist which might prevent them from being scored by SpeechRater. Responses not filtered out at this stage are then processed by the scoring model to estimate the proficiency rating which a human might assign to them, on the basis of features related to fluency, pronunciation, vocabulary diversity, and grammar. Finally, an aggregation model combines an examinee's scores for multiple items to calculate a total score, as well as an interval in which the examinee's score is predicted to reside with high confidence. SpeechRater's current level of accuracy and construct representation have been deemed sufficient for low-stakes practice exercises, and it has been used in a practice exam for the TOEFL since late 2006. In such a practice environment, it offers a number of advantages compared to human raters, including system load management, and the facilitation of immediate feedback to students. However, it must be acknowledged that SpeechRater presently fails to measure many important aspects of speaking proficiency (such as intonation and appropriateness of topic development), and its agreement with human ratings of proficiency does not yet approach the level of agreement between two human raters.