Discriminative features for text document classification

Authors:
K. Torkkola
Affiliations:
Motorola Labs, USA
Venue:
Pattern Analysis & Applications
Year:
2003

Citing 0
Cited 4

Text Mining with an Augmented Version of the Bisecting K-Means Algorithm

ICONIP '09 Proceedings of the 16th International Conference on Neural Information Processing: Part II
Geometrically local embedding in manifolds for dimension reduction

Pattern Recognition
Efficient feature selection filters for high-dimensional data

Pattern Recognition Letters
Equivalence Between LDA/QR and Direct LDA

International Journal of Cognitive Informatics and Natural Intelligence

Quantified Score

Hi-index	0.03

Visualization

Abstract

The bag-of-words approach to text document representation typically results in vectors of the order of 5000–20,000 components as the representation of documents. To make effective use of various statistical classifiers, it may be necessary to reduce the dimensionality of this representation. We point out deficiencies in class discrimination of two popular such methods, Latent Semantic Indexing (LSI), and sequential feature selection according to some relevant criterion. As a remedy, we suggest feature transforms based on Linear Discriminant Analysis (LDA). Since LDA requires operating both with large and dense matrices, we propose an efficient intermediate dimension reduction step using either a random transform or LSI. We report good classification results with the combined feature transform on a subset of the Reuters-21578 database. Drastic reduction of the feature vector dimensionality from 5000 to 12 actually improves the classification performance.