Discriminative Features for Document Classification

Authors:
Kari Torkkola
Affiliations:
-
Venue:
ICPR '02 Proceedings of the 16 th International Conference on Pattern Recognition (ICPR'02) Volume 1 - Volume 1
Year:
2002

Citing 0
Cited 11

A new approach to conceptual document indexing: building a hierarchical system of concepts based on document clusters

ISICT '03 Proceedings of the 1st international symposium on Information and communication technologies
Scoring and Selecting Terms for Text Categorization

IEEE Intelligent Systems
Searching for topics in a large collection of texts

ACLstudent '04 Proceedings of the ACL 2004 workshop on Student research
Information Discriminant Analysis: Feature Extraction with an Information-Theoretic Objective

IEEE Transactions on Pattern Analysis and Machine Intelligence
Approximate information discriminant analysis: A computationally simple heteroscedastic feature extraction technique

Pattern Recognition
Approximate information discriminant analysis: A computationally simple heteroscedastic feature extraction technique

Pattern Recognition
Text classification: a recent overview

ICCOMP'05 Proceedings of the 9th WSEAS International Conference on Computers
Using Intuitionistic Fuzzy Sets in Text Categorization

ICAISC '08 Proceedings of the 9th international conference on Artificial Intelligence and Soft Computing
An efficient discriminant-based solution for small sample size problem

Pattern Recognition
Clustering Documents Using a Wikipedia-Based Concept Representation

PAKDD '09 Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Hierarchical classification of web documents by stratified discriminant analysis

IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document representation using the bag-of-words approach may require bringing the dimensionality of the representation down in order to be able to make effective use of various statistical classification methods. Latent Semantic Indexing (LSI) is one such method that is based on eigendecomposition of the covariance of the document-term matrix. Another often used approach is to select a small number of most important features out of the whole set according to some relevant criterion. This paper points out that LSI ignores discrimination while concentrating on representation. Furthermore, selection methods fail to produce a feature set that jointly optimizes class discrimination. As a remedy, we suggest supervised linear discriminative transforms, and report good classification results applying these to the Reuters-21578 database.