On the importance of parameter tuning in text categorization

  • Authors:
  • Cornelis H. A. Koster;Jean G. Beney

  • Affiliations:
  • Dept. Comp. Sci., University of Nijmegen, The Netherlands;Dept. Informatique, INSA de Lyon, France

  • Venue:
  • PSI'06 Proceedings of the 6th international Andrei Ershov memorial conference on Perspectives of systems informatics
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Text Categorization algorithms have a large number of parameters that determine their behaviour, whose effect is not easily predicted objectively or intuitively and may very well depend on the corpus or on the document representation. Their values are usually taken over from previously published results, which may lead to less than optimal accuracy in experimenting on particular corpora. In this paper we investigate the effect of parameter tuning on the accuracy of two Text Categorization algorithms: the well-known Rocchio algorithm and the lesser-known Winnow. We show that the optimal parameter values for a specific corpus are sometimes very different from those found in literature. We show that the effect of individual parameters is corpus-dependent, and that parameter tuning can greatly improve the accuracy of both Winnow and Rocchio. We argue that the dependence of the categorization algorithms on experimentally established parameter values makes it hard to compare the outcomes of different experiments and propose the automatic determination of optimal parameters on the train set as a solution.