Experimental results of the signal processing approach to distributional clustering of terms on reuters-21578 collection

Authors:
Marta Capdevila Dalmau;Oscar W. Márquez Flórez
Affiliations:
University of Vigo, Telecommunication Engineering School, Signal and Communications Processing Dpt., Vigo, Spain;University of Vigo, Telecommunication Engineering School, Signal and Communications Processing Dpt., Vigo, Spain
Venue:
ECIR'07 Proceedings of the 29th European conference on IR research
Year:
2007

Citing 4
Cited 0

Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A divisive information theoretic feature clustering algorithm for text classification

The Journal of Machine Learning Research
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Distributional Clustering has showed to be an effective and powerful approach to supervised term extraction aimed at reducing the original indexing space dimensionality for Automatic Text Categorization [2]. In a recent paper [1] we introduced a new Signal Processing approach to Distributional Clustering which reached categorization results on 20 Newsgroups dataset similar to those obtained by other information-theoretic approaches [3][4][5]. Here we re-validate our method by showing that the 90-categories Reuters-21578 benchmark collection can be indexed with a minimum loss of categorization accuracy (around 2% with Naïve Bayes categorizer) with only 50 clusters.