Massively parallel feature selection: an approach based on variance preservation

Authors:
Zheng Zhao;James Cox;David Duling;Warren Sarle
Affiliations:
SAS Institute Inc., Cary, NC;SAS Institute Inc., Cary, NC;SAS Institute Inc., Cary, NC;SAS Institute Inc., Cary, NC
Venue:
ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Year:
2012

Citing 15
Cited 0

MPI: The Complete Reference

MPI: The Complete Reference
Feature Selection for Knowledge Discovery and Data Mining

Feature Selection for Knowledge Discovery and Data Mining
Feature Selection for Clustering - A Filter Solution

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Minimum Redundancy Feature Selection from Microarray Gene Expression Data

CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Theoretical and Empirical Analysis of ReliefF and RReliefF

Machine Learning
An introduction to variable and feature selection

The Journal of Machine Learning Research
Use of the zero norm with linear models and kernel methods

The Journal of Machine Learning Research
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Feature Selection for Unsupervised Learning

The Journal of Machine Learning Research
Grid computing for parallel bioinspired algorithms

Journal of Parallel and Distributed Computing - Special issue on parallel bioinspired algorithms
Supervised feature selection via dependence estimation

Proceedings of the 24th international conference on Machine learning
Least squares linear discriminant analysis

Proceedings of the 24th international conference on Machine learning
Spectral feature selection for supervised and unsupervised learning

Proceedings of the 24th international conference on Machine learning
Efficient Parallel Feature Selection for Steganography Problems

IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part I: Bio-Inspired Systems: Computational and Ambient Intelligence
Trace ratio criterion for feature selection

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2

Quantified Score

Hi-index	0.00

Visualization

Abstract

Advances in computer technologies have enabled corporations to accumulate data at an unprecedented speed. Large-scale business data might contain billions of observations and thousands of features, which easily brings their scale to the level of terabytes. Most traditional feature selection algorithms are designed for a centralized computing architecture. Their usability significantly deteriorates when data size exceeds hundreds of gigabytes. High-performance distributed computing frameworks and protocols, such as the Message Passing Interface (MPI) and MapReduce, have been proposed to facilitate software development on grid infrastructures, enabling analysts to process large-scale problems efficiently. This paper presents a novel large-scale feature selection algorithm that is based on variance analysis. The algorithm selects features by evaluating their abilities to explain data variance. It supports both supervised and unsupervised feature selection and can be readily implemented in most distributed computing environments. The algorithm was developed as a SAS High-Performance Analytics procedure, which can read data in distributed form and perform parallel feature selection in both symmetric multiprocessing mode and massively parallel processing mode. Experimental results demonstrated the superior performance of the proposed method for large scale feature selection.