Pruning The Vocabulary For Better Context Recognition

Authors:
Rasmus Elsborg Madsen;Sigurdur Sigurdsson;Lars Kai Hansen;Jan Larsen
Affiliations:
Technical University of Denmark;Technical University of Denmark;Technical University of Denmark;Technical University of Denmark
Venue:
ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 2 - Volume 02
Year:
2004

Citing 0
Cited 3

Text classification: a recent overview

ICCOMP'05 Proceedings of the 9th WSEAS International Conference on Computers
Predicting credit card customer churn in banks using data mining

International Journal of Data Analysis Techniques and Strategies
MuZeeker: adapting a music search engine for mobile phones

Mobile Multimedia Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Language independent 'bag-of-words' representations are surprisingly effective for text classification. The representation is high dimensional though, containing many non-consistent words for text categorization. These non-consistent words result in reduced generalization performance of sub-sequent classifiers, e.g., from ill-posed principal component transformations. In this communication our aim is to study the effect of reducing the least relevant words from the bag-of-words representation. We consider a new approach, using neural network based sensitivity maps and information gainfor determination of term relevancy, when pruning the vocabularies. With reduced vocabularies documents are classified using a latent semantic indexing representation and a probabilistic neural network classifier. Reducing the bag-of-words vocabularies with 90%-98%, we find consistent classification improvement using two mid size data-sets. We also study the applicability of information gain and sensitivity maps for automated keyword generation.