Do NLP and machine learning improve traditional readability formulas?

  • Authors:
  • Thomas François;Eleni Miltsakaki

  • Affiliations:
  • University of Pennsylvania, Philadelphia, PA;University of Pennsylvania & Choosito!, Philadelphia, PA

  • Venue:
  • PITR '12 Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Readability formulas are methods used to match texts with the readers' reading level. Several methodological paradigms have previously been investigated in the field. The most popular paradigm dates several decades back and gave rise to well known readability formulas such as the Flesch formula (among several others). This paper compares this approach (henceforth "classic") with an emerging paradigm which uses sophisticated NLP-enabled features and machine learning techniques. Our experiments, carried on a corpus of texts for French as a foreign language, yield four main results: (1) the new readability formula performed better than the "classic" formula; (2) "non-classic" features were slightly more informative than "classic" features; (3) modern machine learning algorithms did not improve the explanatory power of our readability model, but allowed to better classify new observations; and (4) combining "classic" and "non-classic" features resulted in a significant gain in performance.