Discretizing continuous attributes in AdaBoost for text categorization

  • Authors:
  • Pio Nardiello;Fabrizio Sebastiani;Alessandro Sperduti

  • Affiliations:
  • MercurioWeb SNC, Muro Lucano, PZ, Italy;Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy;Dipartimento di Matematica Pura ed Applicata, Università di Padova, Padova, Italy

  • Venue:
  • ECIR'03 Proceedings of the 25th European conference on IR research
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

We focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, ADABOOST. MH and ADABOOST.MHKR. While the former is a realization of the well-known ADABOOST algorithm specifically aimed at multilabel text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the "weighted" representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations. In this paper we address the problem of exploiting the potential of weighted representations in the context of ADABOOST-like algorithms by discretizing the continuous attributes through the application of entropy-based discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.