Automated text categorization based on readability fingerprints

  • Authors:
  • Mark J. Embrechts;Jonathan Linton;Walter F. Bogaerts;Bram Heyns;Paul Evangelista

  • Affiliations:
  • DSES, Rensselaer Polytechnic Institute, Troy, NY;Telfer School of Managment, University of Ottawa, Canada; ;Science and Engineering Library, University of Leuven, Leuven, Belgium;Paul Evangelista, U.S. Military Academy, West Point, NY

  • Venue:
  • ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper introduces the use of 15 different readability indices as a fingerprint that enables the classification of documents into different categories. While a classification based on such fingerprints alone is not necessarily superior to document categorization based on dedicated dictionaries per se, the document fingerprints can enhance the overall classification rate by applying proper data fusion techniques. For other applications text mining related applications such as language classification, the detection of plagiarism, or author identification, the accuracy of text categorization methods based on readability fingerprints can even exceed a dictionary-based approach. A novel addition to the readability indices is the addition of histograms based on the word length of all the dictionary words used in the text and a dictionary of the most common easy words in the English language.