MPI: The Complete Reference
Feature Selection for Knowledge Discovery and Data Mining
Feature Selection for Knowledge Discovery and Data Mining
Feature Selection for Clustering - A Filter Solution
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Minimum Redundancy Feature Selection from Microarray Gene Expression Data
CSB '03 Proceedings of the IEEE Computer Society Conference on Bioinformatics
Theoretical and Empirical Analysis of ReliefF and RReliefF
Machine Learning
An introduction to variable and feature selection
The Journal of Machine Learning Research
Use of the zero norm with linear models and kernel methods
The Journal of Machine Learning Research
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Feature Selection for Unsupervised Learning
The Journal of Machine Learning Research
Grid computing for parallel bioinspired algorithms
Journal of Parallel and Distributed Computing - Special issue on parallel bioinspired algorithms
Supervised feature selection via dependence estimation
Proceedings of the 24th international conference on Machine learning
Least squares linear discriminant analysis
Proceedings of the 24th international conference on Machine learning
Spectral feature selection for supervised and unsupervised learning
Proceedings of the 24th international conference on Machine learning
Efficient Parallel Feature Selection for Steganography Problems
IWANN '09 Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part I: Bio-Inspired Systems: Computational and Ambient Intelligence
Trace ratio criterion for feature selection
AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Hi-index | 0.00 |
Advances in computer technologies have enabled corporations to accumulate data at an unprecedented speed. Large-scale business data might contain billions of observations and thousands of features, which easily brings their scale to the level of terabytes. Most traditional feature selection algorithms are designed for a centralized computing architecture. Their usability significantly deteriorates when data size exceeds hundreds of gigabytes. High-performance distributed computing frameworks and protocols, such as the Message Passing Interface (MPI) and MapReduce, have been proposed to facilitate software development on grid infrastructures, enabling analysts to process large-scale problems efficiently. This paper presents a novel large-scale feature selection algorithm that is based on variance analysis. The algorithm selects features by evaluating their abilities to explain data variance. It supports both supervised and unsupervised feature selection and can be readily implemented in most distributed computing environments. The algorithm was developed as a SAS High-Performance Analytics procedure, which can read data in distributed form and perform parallel feature selection in both symmetric multiprocessing mode and massively parallel processing mode. Experimental results demonstrated the superior performance of the proposed method for large scale feature selection.