Bayesian classification (AutoClass): theory and results
Advances in knowledge discovery and data mining
Class prediction and discovery using gene expression data
RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
ROCK: a robust clustering algorithm for categorical attributes
Information Systems
Clustering Algorithms
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
Data Mining and Knowledge Discovery
Techniques of Cluster Algorithms in Data Mining
Data Mining and Knowledge Discovery
A survey of data mining and knowledge discovery software tools
ACM SIGKDD Explorations Newsletter
Gene-Ontology-based clustering of gene expression data
Bioinformatics
THEA: ontology-driven analysis of microarray data
Bioinformatics
Bi-level clustering of mixed categorical and numerical biomedical data
International Journal of Data Mining and Bioinformatics
Clustering the internet topology at the AS-level
SMO'05 Proceedings of the 5th WSEAS international conference on Simulation, modelling and optimization
Hi-index | 0.00 |
We present the M-BILCOM algorithm for clustering mixed numerical and categorical data sets, in which the categorical attribute values (CAs) are not certain to be correct and have associated confidence values (CVs) from 0.0 to 1.0 to represent their certainty of correctness. M-BILCOM performs bi-level clustering of mixed data sets resembling a Bayesian process. We have applied M-BILCOM to yeast data sets in which the CAs were perturbed randomly and CVs were assigned indicating the confidence of correctness of the CAs. On such mixed data sets M-BILCOM outperforms other clustering algorithms, such as AutoClass. We have applied M-BILCOM to real numerical data sets from gene expression studies on yeast, incorporating CAs representing Gene Ontology annotations on the genes and CVs representing Gene Ontology Evidence Codes on the CAs. We apply novel significance metrics to the CAs in resulting clusters, to extract the most significant CAs based on their frequencies and their CVs in the cluster. For genomic data sets, we use the most significant CAs in a cluster to predict gene function.