Fast and accurate text classification via multiple linear discriminant projections

Authors:
Soumen Chakrabarti;Shourya Roy;Mahesh V. Soundalgekar
Affiliations:
IIT, Bombay;IIT, Bombay;IIT, Bombay
Venue:
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Year:
2002

Citing 12
Cited 10

The Johnson-Lindenstrauss Lemma and the sphericity of some graphs

Journal of Combinatorial Theory Series A
A comparison of classifiers and document representations for the routing problem

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Training algorithms for linear text classifiers

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Two algorithms for nearest-neighbor search in high dimensions

STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Making large-scale support vector machine learning practical

Advances in kernel methods
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Proximal support vector machine classifiers

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Two Variations on Fisher's Linear Discriminant for Pattern Recognition

IEEE Transactions on Pattern Analysis and Machine Intelligence
On the Relationship Between the Support Vector Machine for Classification and Sparsified Fisher‘s Linear Discriminant

Neural Processing Letters
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
SPRINT: A Scalable Parallel Classifier for Data Mining

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases

Efficient multi-way text categorization via generalized discriminant analysis

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Index construction for linear categorisation

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Feature selection using linear classifier weights: interaction with classification models

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Developing practical automatic metadata assignment and evaluation tools for internet resources

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
IDR/QR: An Incremental Dimension Reduction Algorithm via QR Decomposition

IEEE Transactions on Knowledge and Data Engineering
Hierarchical document classification using automatically generated hierarchy

Journal of Intelligent Information Systems
An integrated system for building enterprise taxonomies

Information Retrieval
Discriminant Subspace Analysis: A Fukunaga-Koontz Approach

IEEE Transactions on Pattern Analysis and Machine Intelligence
Text categorization via generalized discriminant analysis

Information Processing and Management: an International Journal
Fuzzy integral to speed up support vector machines training for pattern classification

International Journal of Knowledge-based and Intelligent Engineering Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Support vector machines (SVMs) have shown superb performance for text classification tasks. They are accurate, robust, and quick to apply to test instances. Their only potential drawback is their training time and memory requirement. For n training instances held in memory, the best-known SVM implementations take time proportional to na, where a is typically between 1.8 and 2.1. SVMs have been trained on data sets with several thousand instances, but Web directories today contain millions of instances which are valuable for mapping billions of Web pages into Yahoo!-like directories. We present SIMPL, a nearly linear-time classification algorithm which mimics the strengths of SVMs while avoiding the training bottleneck. It uses Fisher's linear discriminant, a classical tool from statistical pattern recognition, to project training instances to a carefully selected low-dimensional subspace before inducing a decision tree on the projected instances. SIMPL uses efficient sequential scans and sorts, and is comparable in speed and memory scalability to widely-used naive Bayes (NB) classifiers, but it beats NB accuracy decisively. It not only approaches and sometimes exceeds SVM accuracy, but also beats SVM running time by orders of magnitude. While developing SIMPL, we also make a detailed experimental analysis of the cache performance of SVMs.