Text categorization for a comprehensive time-dependent benchmark

Authors:
Fred J. Damerau;Tong Zhang;Sholom M. Weiss;Nitin Indurkhya
Affiliations:
IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY;IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY;IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY;IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY
Venue:
Information Processing and Management: an International Journal
Year:
2004

Citing 10
Cited 4

C4.5: programs for machine learning

C4.5: programs for machine learning
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
On the Dual Formulation of Regularized Linear Systems with Convex Risks

Machine Learning
Maximizing Text-Mining Performance

IEEE Intelligent Systems
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Statistical Behavior and Consistency of Support Vector Machines, Boosting, and Beyond

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
A decision-tree-based symbolic rule induction system for text categorization

IBM Systems Journal

Focused named entity recognition using machine learning

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
A robust risk minimization based named entity recognition system

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Using scatterplots to understand and improve probabilistic models for text categorization and retrieval

International Journal of Approximate Reasoning
An Empirical Comparison of Machine Learning Techniques in Predicting the Bug Severity of Open and Closed Source Projects

International Journal of Open Source Software and Processes

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present results for automated text categorization of the Reuters-810000 collection of news stories. Our experiments use the entire one-year collection of 810,000 stories and the entire subject index. We divide the data into monthly groups and provide an initial benchmark of text categorization performance on the complete collection. Experimental results show that efficient sparse-feature implementations of linear methods and decision trees, using a global unstemmed dictionary, can readily handle applications of this size. Predictive performance is approximately as strong as the best results for the much smaller older Reuters collections. Detailed results are provided over time periods. It is shown that a smaller time horizon does not appreciably diminish predictive quality, implying reduced demands for retraining when sample size is large.