Automated text categorization based on readability fingerprints

Authors:
Mark J. Embrechts;Jonathan Linton;Walter F. Bogaerts;Bram Heyns;Paul Evangelista
Affiliations:
DSES, Rensselaer Polytechnic Institute, Troy, NY;Telfer School of Managment, University of Ottawa, Canada; ;Science and Engineering Library, University of Leuven, Leuven, Belgium;Paul Evangelista, U.S. Military Academy, West Point, NY
Venue:
ICANN'07 Proceedings of the 17th international conference on Artificial neural networks
Year:
2007

Citing 7
Cited 0

Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
Self-Organizing Maps

Self-Organizing Maps
Guest Editors' Introduction to the Special Issue on Automated Text Categorization

Journal of Intelligent Information Systems
Kernel partial least squares regression in reproducing kernel hilbert space

The Journal of Machine Learning Research
Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Learning Spectral Clustering, With Application To Speech Separation

The Journal of Machine Learning Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces the use of 15 different readability indices as a fingerprint that enables the classification of documents into different categories. While a classification based on such fingerprints alone is not necessarily superior to document categorization based on dedicated dictionaries per se, the document fingerprints can enhance the overall classification rate by applying proper data fusion techniques. For other applications text mining related applications such as language classification, the detection of plagiarism, or author identification, the accuracy of text categorization methods based on readability fingerprints can even exceed a dictionary-based approach. A novel addition to the readability indices is the addition of histograms based on the word length of all the dictionary words used in the text and a dictionary of the most common easy words in the English language.