Scaling up text classification for large file systems

Authors:
George Forman;Shyamsundar Rajaram
Affiliations:
Hewlett-Packard Labs, Palo Alto, CA, USA;Hewlett-Packard Labs, Palo Alto, CA, USA
Venue:
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2008

Citing 12
Cited 3

On Combining Classifiers

IEEE Transactions on Pattern Analysis and Machine Intelligence
A large-scale study of file-system contents

SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Efficient query evaluation using a two-level retrieval process

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Lucene in Action (In Action series)

Lucene in Action (In Action series)
Optimization Design of Cascaded Classifiers

CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Quantifying trends accurately despite classifier error and class imbalance

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Training linear SVMs in linear time

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Effective and efficient classification on a search-engine model

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Experimental perspectives on learning from imbalanced data

Proceedings of the 24th international conference on Machine learning

Extremely fast text feature extraction for classification and indexing

Proceedings of the 17th ACM conference on Information and knowledge management
Automate back office activity monitoring to drive operational excellence

ICSOC'12 Proceedings of the 10th international conference on Service-Oriented Computing
Supervised and semi-supervised learning in text classification using enhanced KNN algorithm: a comparative study of supervised and semi-supervised classification in text categorisation

International Journal of Intelligent Systems Technologies and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We combine the speed and scalability of information retrieval with the generally superior classification accuracy offered by machine learning, yielding a two-phase text classifier that can scale to very large document corpora. We investigate the effect of different methods of formulating the query from the training set, as well as varying the query size. In empirical tests on the Reuters RCV1 corpus of 806,000 documents, we find runtime was easily reduced by a factor of 27x, with a somewhat surprising gain in F-measure compared with traditional text classification.