Wrappers for feature subset selection
Artificial Intelligence - Special issue on relevance
A re-examination of text categorization methods
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Feature Selection for Unbalanced Class Distribution and Naive Bayes
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
An extensive empirical study of feature selection metrics for text classification
The Journal of Machine Learning Research
Benchmarking Attribute Selection Techniques for Discrete Class Data Mining
IEEE Transactions on Knowledge and Data Engineering
IEEE Intelligent Systems
A study of local and global thresholding techniques in text categorization
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Pairwise vs global multi-class wrapper feature selection
AIKED'07 Proceedings of the 6th Conference on 6th WSEAS Int. Conf. on Artificial Intelligence, Knowledge Engineering and Data Bases - Volume 6
Boosting multi-label hierarchical text categorization
Information Retrieval
Gene ontology annotation as text categorization: An empirical study
Information Processing and Management: an International Journal
One Lead ECG Based Personal Identification with Feature Subspace Ensembles
MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
F-score with Pareto Front Analysis for Multiclass Gene Selection
EvoBIO '09 Proceedings of the 7th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics
Multi-facet Rating of Product Reviews
ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
Feature selection for ordinal regression
Proceedings of the 2010 ACM Symposium on Applied Computing
Automatic extraction of domain-specific stopwords from labeled documents
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
MP-Boost: a multiple-pivot boosting algorithm and its application to text categorization
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
TreeBoost.MH: a boosting algorithm for multi-label hierarchical text categorization
SPIRE'06 Proceedings of the 13th international conference on String Processing and Information Retrieval
A novel field learning algorithm for dual imbalance text classification
FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part II
Feature selection for MAUC-oriented classification systems
Neurocomputing
Comparison of text feature selection policies and using an adaptive framework
Expert Systems with Applications: An International Journal
Multiclass Gene Selection Using Pareto-Fronts
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Feature selection for ordinal text classification
Neural Computation
Feature ranking fusion for text classifier
Intelligent Data Analysis
Hi-index | 0.00 |
Information Gain is a well-known and empirically proven method for high-dimensional feature selection. We found that it and other existing methods failed to produce good results on an industrial text classification problem. On investigating the root cause, we find that a large class of feature scoring methods suffers a pitfall: they can be blinded by a surplus of strongly predictive features for some classes, while largely ignoring features needed to discriminate difficult classes. In this paper we demonstrate this pitfall hurts performance even for a relatively uniform text classification task. Based on this understanding, we present solutions inspired by round-robin scheduling that avoid this pitfall, without resorting to costly wrapper methods. Empirical evaluation on 19 datasets shows substantial improvements.