Automatic scoring of non-native spontaneous speech in tests of spoken English

  • Authors:
  • Klaus Zechner;Derrick Higgins;Xiaoming Xi;David M. Williamson

  • Affiliations:
  • Educational Testing Service, Automated Scoring and NLP, Rosedale Road, MS 11-R, Princeton, NJ 08541, USA;Educational Testing Service, Automated Scoring and NLP, Rosedale Road, MS 11-R, Princeton, NJ 08541, USA;Educational Testing Service, Automated Scoring and NLP, Rosedale Road, MS 11-R, Princeton, NJ 08541, USA;Educational Testing Service, Automated Scoring and NLP, Rosedale Road, MS 11-R, Princeton, NJ 08541, USA

  • Venue:
  • Speech Communication
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents the first version of the SpeechRater^S^M system for automatically scoring non-native spontaneous high-entropy speech in the context of an online practice test for prospective takers of the Test of English as a Foreign Language^(R) internet-based test (TOEFL^(R) iBT). The system consists of a speech recognizer trained on non-native English speech data, a feature computation module, using speech recognizer output to compute a set of mostly fluency based features, and a multiple regression scoring model which predicts a speaking proficiency score for every test item response, using a subset of the features generated by the previous component. Experiments with classification and regression trees (CART) complement those performed with multiple regression. We evaluate the system both on TOEFL Practice data [TOEFL Practice Online (TPO)] as well as on Field Study data collected before the introduction of the TOEFL iBT. Features are selected by test development experts based on both their empirical correlations with human scores as well as on their coverage of the concept of communicative competence. We conclude that while the correlation between machine scores and human scores on TPO (of 0.57) still differs by 0.17 from the inter-human correlation (of 0.74) on complete sets of six items (Pearson r correlation coefficients), the correlation of 0.57 is still high enough to warrant the deployment of the system in a low-stakes practice environment, given its coverage of several important aspects of communicative competence such as fluency, vocabulary diversity, grammar, and pronunciation. Another reason why the deployment of the system in a low-stakes practice environment is warranted is that this system is an initial version of a long-term research and development program where features related to vocabulary, grammar, and content will be added in a later stage when automatic speech recognition performance improves, which can then be easily achieved without a re-design of the system. Exact agreement on single TPO items between our system and human scores was 57.8%, essentially at par with inter-human agreement of 57.2%. Our system has been in operational use to score TOEFL Practice Online Speaking tests since the Fall of 2006 and has since scored tens of thousands of tests.