Relevant attribute discovery in high dimensional data based on rough sets and unsupervised classification: application to leukemia gene expressions

Authors:
Julio J. Valdés;Alan J. Barton
Affiliations:
National Research Council Canada, Ottawa, ON;National Research Council Canada, Ottawa, ON
Venue:
RSFDGrC'05 Proceedings of the 10th international conference on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing - Volume Part II
Year:
2005

Citing 7
Cited 1

Multidimensional similarity structure analysis

Multidimensional similarity structure analysis
Rough Sets: Theoretical Aspects of Reasoning about Data

Rough Sets: Theoretical Aspects of Reasoning about Data
Clustering Algorithms

Clustering Algorithms
Dynamic Reducts as a Tool for Extracting Laws from Decisions Tables

ISMIS '94 Proceedings of the 8th International Symposium on Methodologies for Intelligent Systems
Gene discovery in leukemia revisited: a computational intelligence perspective

IEA/AIE'2004 Proceedings of the 17th international conference on Innovations in applied artificial intelligence
Ensembles of Classifiers Based on Approximate Reducts

Fundamenta Informaticae - Concurrency Specification and Programming (CS&P'2000)
A Nonlinear Mapping for Data Structure Analysis

IEEE Transactions on Computers

Computational intelligence techniques: a study of scleroderma skin disease

Proceedings of the 9th annual conference companion on Genetic and evolutionary computation

Quantified Score

Hi-index	0.00

Visualization

Abstract

A pipelined approach using two clustering algorithms in combination with Rough Sets is investigated for the purpose discovering important combination of attributes in high dimensional data. In many domains, the data objects are described in terms of a large number of features, like in gene expression experiments, or in samples characterized by spectral information. The Leader and several k-means algorithms are used as fast procedures for attribute set simplification of the information systems presented to the rough sets algorithms. The data submatrices described in terms of these features are then discretized w.r.t the decision attribute according to different rough set based schemes. From them, the reducts and their derived rules are extracted, which are applied to test data in order to evaluate the resulting classification accuracy. An exploration of this approach (using Leukemia gene expression data) was conducted in a series of experiments within a high-throughput distributed-computing environment. They led to subsets of genes with high discrimination power. Good results were obtained with no preprocessing applied to the data.