A pitfall and solution in multi-class feature selection for text classification

Authors:
George Forman
Affiliations:
Hewlett-Packard Labs, Palo Alto, CA
Venue:
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Year:
2004

Citing 9
Cited 18

Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Gene Selection for Cancer Classification using Support Vector Machines

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification

The Journal of Machine Learning Research
Benchmarking Attribute Selection Techniques for Discrete Class Data Mining

IEEE Transactions on Knowledge and Data Engineering

Evolving Feature Selection

IEEE Intelligent Systems
A study of local and global thresholding techniques in text categorization

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Pairwise vs global multi-class wrapper feature selection

AIKED'07 Proceedings of the 6th Conference on 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases - Volume 6
Boosting multi-label hierarchical text categorization

Information Retrieval
Gene ontology annotation as text categorization: An empirical study

Information Processing and Management: an International Journal
One Lead ECG Based Personal Identification with Feature Subspace Ensembles

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
F-score with Pareto Front Analysis for Multiclass Gene Selection

EvoBIO '09 Proceedings of the 7th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics
Multi-facet Rating of Product Reviews

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Feature selection for ordinal regression

Proceedings of the 2010 ACM Symposium on Applied Computing
Automatic extraction of domain-specific stopwords from labeled documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
MP-Boost: a multiple-pivot boosting algorithm and its application to text categorization

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
TreeBoost.MH: a boosting algorithm for multi-label hierarchical text categorization

SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
A novel field learning algorithm for dual imbalance text classification

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part II
Feature selection for MAUC-oriented classification systems

Neurocomputing
Comparison of text feature selection policies and using an adaptive framework

Expert Systems with Applications: An International Journal
Multiclass Gene Selection Using Pareto-Fronts

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Feature selection for ordinal text classification

Neural Computation
Feature ranking fusion for text classifier

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Information Gain is a well-known and empirically proven method for high-dimensional feature selection. We found that it and other existing methods failed to produce good results on an industrial text classification problem. On investigating the root cause, we find that a large class of feature scoring methods suffers a pitfall: they can be blinded by a surplus of strongly predictive features for some classes, while largely ignoring features needed to discriminate difficult classes. In this paper we demonstrate this pitfall hurts performance even for a relatively uniform text classification task. Based on this understanding, we present solutions inspired by round-robin scheduling that avoid this pitfall, without resorting to costly wrapper methods. Empirical evaluation on 19 datasets shows substantial improvements.