Best subset feature selection for massive mixed-type problems

Authors:
Eugene Tuv;Alexander Borisov;Kari Torkkola
Affiliations:
Intel, Analysis and Control Technology, Chandler, AZ;Intel, Analysis and Control Technology, N.Novgorod, Russia;Motorola, Intelligent Systems Lab, Tempe, AZ
Venue:
IDEAL'06 Proceedings of the 7th international conference on Intelligent Data Engineering and Automated Learning
Year:
2006

Citing 3
Cited 3

The Random Subspace Method for Constructing Decision Forests

IEEE Transactions on Pattern Analysis and Machine Intelligence
Random Forests

Machine Learning
Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5

ICML '04 Proceedings of the twenty-first international conference on Machine learning

Constructing High Dimensional Feature Space for Time Series Classification

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Automatically finding the control variables for complex system behavior

Automated Software Engineering
A distributed hebb neural network for network anomaly detection

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We address the problem of identifying a non-redundant subset of important variables. All modern feature selection approaches including filters, wrappers, and embedded methods experience problems in very general settings with massive mixed-type data, and with complex relationships between the inputs and the target. We propose an efficient ensemble-based approach measuring statistical independence between a target and a potentially very large number of inputs including any meaningful order of interactions between them, removing redundancies from the relevant ones, and finally ranking variables in the identified minimum feature set. Experiments with synthetic data illustrate the sensitivity and the selectivity of the method, whereas the scalability of the method is demonstrated with a real car sensor data base.