IEEE Transactions on Pattern Analysis and Machine Intelligence
A large-scale study of file-system contents
SIGMETRICS '99 Proceedings of the 1999 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
Efficient query evaluation using a two-level retrieval process
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Lucene in Action (In Action series)
Lucene in Action (In Action series)
Optimization Design of Cascaded Classifiers
CVPR '05 Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Volume 1 - Volume 01
Quantifying trends accurately despite classifier error and class imbalance
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Training linear SVMs in linear time
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Effective and efficient classification on a search-engine model
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Experimental perspectives on learning from imbalanced data
Proceedings of the 24th international conference on Machine learning
Extremely fast text feature extraction for classification and indexing
Proceedings of the 17th ACM conference on Information and knowledge management
Automate back office activity monitoring to drive operational excellence
ICSOC'12 Proceedings of the 10th international conference on Service-Oriented Computing
International Journal of Intelligent Systems Technologies and Applications
Hi-index | 0.00 |
We combine the speed and scalability of information retrieval with the generally superior classification accuracy offered by machine learning, yielding a two-phase text classifier that can scale to very large document corpora. We investigate the effect of different methods of formulating the query from the training set, as well as varying the query size. In empirical tests on the Reuters RCV1 corpus of 806,000 documents, we find runtime was easily reduced by a factor of 27x, with a somewhat surprising gain in F-measure compared with traditional text classification.