Do NLP and machine learning improve traditional readability formulas?

Authors:
Thomas François;Eleni Miltsakaki
Affiliations:
University of Pennsylvania, Philadelphia, PA;University of Pennsylvania & Choosito!, Philadelphia, PA
Venue:
PITR '12 Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations
Year:
2012

Citing 9
Cited 0

A statistical model for scientific readability

Proceedings of the tenth international conference on Information and knowledge management
Predicting reading difficulty with statistical language models

Journal of the American Society for Information Science and Technology
Reading level assessment using support vector machines and statistical language models

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Matching readers' preferences and reading skills with appropriate web texts

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session
Revisiting readability: a unified framework for predicting text quality

EMNLP '08 Proceedings of the Conference on Empirical Methods in Natural Language Processing
An analysis of statistical models and features for reading difficulty prediction

EANL '08 Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications
Readability assessment for text simplification

IUNLPBEA '10 Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications
Learning to predict readability using diverse linguistic features

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
A comparison of features for automatic readability assessment

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters

Quantified Score

Hi-index	0.00

Visualization

Abstract

Readability formulas are methods used to match texts with the readers' reading level. Several methodological paradigms have previously been investigated in the field. The most popular paradigm dates several decades back and gave rise to well known readability formulas such as the Flesch formula (among several others). This paper compares this approach (henceforth "classic") with an emerging paradigm which uses sophisticated NLP-enabled features and machine learning techniques. Our experiments, carried on a corpus of texts for French as a foreign language, yield four main results: (1) the new readability formula performed better than the "classic" formula; (2) "non-classic" features were slightly more informative than "classic" features; (3) modern machine learning algorithms did not improve the explanatory power of our readability model, but allowed to better classify new observations; and (4) combining "classic" and "non-classic" features resulted in a significant gain in performance.