Clustering mixed numerical and low quality categorical data: significance metrics on a yeast example

Authors:
Bill Andreopoulos;Aijun An;Xiaogang Wang
Affiliations:
York University, Toronto, Canada;York University, Toronto, Canada;York University, Toronto, Canada
Venue:
Proceedings of the 2nd international workshop on Information quality in information systems
Year:
2005

Citing 9
Cited 3

Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Class prediction and discovery using gene expression data

RECOMB '00 Proceedings of the fourth annual international conference on Computational molecular biology
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Clustering Algorithms

Clustering Algorithms
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Techniques of Cluster Algorithms in Data Mining

Data Mining and Knowledge Discovery
A survey of data mining and knowledge discovery software tools

ACM SIGKDD Explorations Newsletter
Gene-Ontology-based clustering of gene expression data

Bioinformatics
THEA: ontology-driven analysis of microarray data

Bioinformatics

Report from the First and Second International Workshops on Information Quality in Information Systems: IQIS 2004 and IQIS 2005 in conjunction with ACM SIGMOD/PODS Conferences

ACM SIGMOD Record
Bi-level clustering of mixed categorical and numerical biomedical data

International Journal of Data Mining and Bioinformatics
Clustering the internet topology at the AS-level

SMO'05 Proceedings of the 5th WSEAS international conference on Simulation, modelling and optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present the M-BILCOM algorithm for clustering mixed numerical and categorical data sets, in which the categorical attribute values (CAs) are not certain to be correct and have associated confidence values (CVs) from 0.0 to 1.0 to represent their certainty of correctness. M-BILCOM performs bi-level clustering of mixed data sets resembling a Bayesian process. We have applied M-BILCOM to yeast data sets in which the CAs were perturbed randomly and CVs were assigned indicating the confidence of correctness of the CAs. On such mixed data sets M-BILCOM outperforms other clustering algorithms, such as AutoClass. We have applied M-BILCOM to real numerical data sets from gene expression studies on yeast, incorporating CAs representing Gene Ontology annotations on the genes and CVs representing Gene Ontology Evidence Codes on the CAs. We apply novel significance metrics to the CAs in resulting clusters, to extract the most significant CAs based on their frequencies and their CVs in the cluster. For genomic data sets, we use the most significant CAs in a cluster to predict gene function.