On the importance of parameter tuning in text categorization

Authors:
Cornelis H. A. Koster;Jean G. Beney
Affiliations:
Dept. Comp. Sci., University of Nijmegen, The Netherlands;Dept. Informatique, INSA de Lyon, France
Venue:
PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
Year:
2006

Citing 10
Cited 3

Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
Learning routing queries in a query zone

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Context-sensitive learning methods for text categorization

ACM Transactions on Information Systems (TOIS)
A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization

Text databases & document management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A new family of online algorithms for category ranking

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
General Convergence Results for Linear Discriminant Updates

Machine Learning
Uncertainty-Based Noise Reduction and Term Selection in Text Categorization

Proceedings of the 24th BCS-IRSG European Colloquium on IR Research: Advances in Information Retrieval
Taming wild phrases

ECIR'03 Proceedings of the 25th European conference on IR research
A study on optimal parameter tuning for Rocchio text classifier

ECIR'03 Proceedings of the 25th European conference on IR research

CyberIR --- A Technological Approach to Fight Cybercrime

PAISI, PACCF and SOCO '08 Proceedings of the IEEE ISI 2008 PAISI, PACCF, and SOCO international workshops on Intelligence and Security Informatics
Phrase-based document categorization revisited

Proceedings of the 2nd international workshop on Patent information retrieval
SVM paradoxes

PSI'09 Proceedings of the 7th international Andrei Ershov Memorial conference on Perspectives of Systems Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text Categorization algorithms have a large number of parameters that determine their behaviour, whose effect is not easily predicted objectively or intuitively and may very well depend on the corpus or on the document representation. Their values are usually taken over from previously published results, which may lead to less than optimal accuracy in experimenting on particular corpora. In this paper we investigate the effect of parameter tuning on the accuracy of two Text Categorization algorithms: the well-known Rocchio algorithm and the lesser-known Winnow. We show that the optimal parameter values for a specific corpus are sometimes very different from those found in literature. We show that the effect of individual parameters is corpus-dependent, and that parameter tuning can greatly improve the accuracy of both Winnow and Rocchio. We argue that the dependence of the categorization algorithms on experimentally established parameter values makes it hard to compare the outcomes of different experiments and propose the automatic determination of optimal parameters on the train set as a solution.