Iteratively Selecting Feature Subsets for Mining from High-Dimensional Databases

Authors:
Hiroshi Mamitsuka
Affiliations:
-
Venue:
PKDD '02 Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery
Year:
2002

Citing 11
Cited 1

Query by committee

COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
C4.5: programs for machine learning

C4.5: programs for machine learning
Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
Attribute selection for modelling

Future Generation Computer Systems - Special double issue on data mining
Making large-scale support vector machine learning practical

Advances in kernel methods
Feature Selection for Knowledge Discovery and Data Mining

Feature Selection for Knowledge Discovery and Data Mining
A Survey of Methods for Scaling Up Inductive Algorithms

Data Mining and Knowledge Discovery
Pasting Small Votes for Classification in Large Databases and On-Line

Machine Learning
Feature selection for high-dimensional genomic microarray data

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
On Feature Selection: Learning with Exponentially Many Irrelevant Features as Training Examples

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Efficient Mining from Large Databases by Query Learning

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning

Artifacts of Markov blanket filtering based on discretized features in small sample size applications

Pattern Recognition Letters

Quantified Score

Hi-index	0.00

Visualization

Abstract

We propose a new data mining method that is effective for mining from extremely high-dimensional databases. Our proposed method iteratively selects a subset of features from a database and builds a hypothesis with the subset. Our selection of a feature subset has two steps, i.e. selecting a subset of instances from the database, to which predictions by multiple hypotheses previously obtained are most unreliable, and then selecting a subset of features, the distribution of whose values in the selected instances varies the most from that in all instances of the database. We empirically evaluate the effectiveness of the proposed method by comparing its performance with those of two other methods, including Xing et al.'s one of the latest feature subset selection methods. The evaluation was performed on a real-world data set with approximately 140,000 features. Our results show that the performance of the proposed method exceeds those of the other methods, both in terms of the final predictive accuracy and the precision attained at a recall given by Xing et al.'s method. We have also examined the effect of noise in the data and found that the advantage of the proposed method becomes more pronounced for larger noise levels.