Discretizing continuous attributes in AdaBoost for text categorization

Authors:
Pio Nardiello;Fabrizio Sebastiani;Alessandro Sperduti
Affiliations:
MercurioWeb SNC, Muro Lucano, PZ, Italy;Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche, Pisa, Italy;Dipartimento di Matematica Pura ed Applicata, Università di Padova, Padova, Italy
Venue:
ECIR'03 Proceedings of the 25th European conference on IR research
Year:
2003

Citing 11
Cited 8

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets

Machine Learning
Evaluating and optimizing autonomous text classification systems

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
A decision-theoretic generalization of on-line learning and an application to boosting

Journal of Computer and System Sciences - Special issue: 26th annual ACM symposium on the theory of computing & STOC'94, May 23–25, 1994, and second annual Europe an conference on computational learning theory (EuroCOLT'95), March 13–15, 1995
Improved Boosting Algorithms Using Confidence-rated Predictions

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
An improved boosting algorithm and its application to text categorization

Proceedings of the ninth international conference on Information and knowledge management
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improved use of continuous attributes in C4.5

Journal of Artificial Intelligence Research
ChiMerge: discretization of numeric attributes

AAAI'92 Proceedings of the tenth national conference on Artificial intelligence

Adding numbers to text classification

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Distributional term representations: an experimental comparison

Proceedings of the thirteenth ACM international conference on Information and knowledge management
An analysis of the relative hardness of Reuters-21578 subsets: Research Articles

Journal of the American Society for Information Science and Technology
Automatic expansion of domain-specific lexicons by term categorization

ACM Transactions on Speech and Language Processing (TSLP)
Combining rough decisions for intelligent text mining using Dempster's rule

Artificial Intelligence Review
Text classification: a recent overview

ICCOMP'05 Proceedings of the 9th WSEAS International Conference on Computers
Text categorization using an ensemble classifier based on a mean co-association matrix

MLDM'12 Proceedings of the 8th international conference on Machine Learning and Data Mining in Pattern Recognition
An Embedded Co-AdaBoost based construction of software document relation coupled resource spaces for cyber-physical society

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

We focus on two recently proposed algorithms in the family of "boosting"-based learners for automated text classification, ADABOOST. MH and ADABOOST.MHKR. While the former is a realization of the well-known ADABOOST algorithm specifically aimed at multilabel text categorization, the latter is a generalization of the former based on the idea of learning a committee of classifier sub-committees. Both algorithms have been among the best performers in text categorization experiments so far. A problem in the use of both algorithms is that they require documents to be represented by binary vectors, indicating presence or absence of the terms in the document. As a consequence, these algorithms cannot take full advantage of the "weighted" representations (consisting of vectors of continuous attributes) that are customary in information retrieval tasks, and that provide a much more significant rendition of the document's content than binary representations. In this paper we address the problem of exploiting the potential of weighted representations in the context of ADABOOST-like algorithms by discretizing the continuous attributes through the application of entropy-based discretization methods. We present experimental results on the Reuters-21578 text categorization collection, showing that for both algorithms the version with discretized continuous attributes outperforms the version with traditional binary representations.