Efficient handling of high-dimensional feature spaces by randomized classifier ensembles

Authors:
Aleksander Kołcz;Xiaomei Sun;Jugal Kalita
Affiliations:
Personalogy, Inc., Colorado Springs, CO;University of Colorado at Colorado Springs, Colorado Springs, CO;University of Colorado at Colorado Springs, Colorado Springs, CO
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 9
Cited 6

Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
The nature of statistical learning theory

The nature of statistical learning theory
Bagging predictors

Machine Learning
The theoretical and experimental status of the n-tuple classifier

Neural Networks
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
BoosTexter: A Boosting-based Systemfor Text Categorization

Machine Learning - Special issue on information retrieval
An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization

Machine Learning
Random Forests

Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning

A probabilistic classifier system and its application in data mining

Evolutionary Computation
Learning to classify e-mail

Information Sciences: an International Journal
Privacy-preserving boosting

Data Mining and Knowledge Discovery
Intelligent file scoring system for malware detection from the gray list

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Combining file content and file relations for cloud based malware detection

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
TAKES: a fast method to select features in the kernel space

Proceedings of the 20th ACM international conference on Information and knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Handling massive datasets is a difficult problem not only due to prohibitively large numbers of entries but in some cases also due to the very high dimensionality of the data. Often, severe feature selection is performed to limit the number of attributes to a manageable size, which unfortunately can lead to a loss of useful information. Feature space reduction may well be necessary for many stand-alone classifiers, but recent advances in the area of ensemble classifier techniques indicate that overall accurate classifier aggregates can be learned even if each individual classifier operates on incomplete "feature view" training data, i.e., such where certain input attributes are excluded. In fact, by using only small random subsets of features to build individual component classifiers, surprisingly accurate and robust models can be created. In this work we demonstrate how these types of architectures effectively reduce the feature space for submodels and groups of sub-models, which lends itself to efficient sequential and/or parallel implementations. Experiments with a randomized version of Adaboost are used to support our arguments, using the text classification task as an example.