Clustering mixed numerical and low quality categorical data: significance metrics on a yeast example

  • Authors:
  • Bill Andreopoulos;Aijun An;Xiaogang Wang

  • Affiliations:
  • York University, Toronto, Canada;York University, Toronto, Canada;York University, Toronto, Canada

  • Venue:
  • Proceedings of the 2nd international workshop on Information quality in information systems
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

We present the M-BILCOM algorithm for clustering mixed numerical and categorical data sets, in which the categorical attribute values (CAs) are not certain to be correct and have associated confidence values (CVs) from 0.0 to 1.0 to represent their certainty of correctness. M-BILCOM performs bi-level clustering of mixed data sets resembling a Bayesian process. We have applied M-BILCOM to yeast data sets in which the CAs were perturbed randomly and CVs were assigned indicating the confidence of correctness of the CAs. On such mixed data sets M-BILCOM outperforms other clustering algorithms, such as AutoClass. We have applied M-BILCOM to real numerical data sets from gene expression studies on yeast, incorporating CAs representing Gene Ontology annotations on the genes and CVs representing Gene Ontology Evidence Codes on the CAs. We apply novel significance metrics to the CAs in resulting clusters, to extract the most significant CAs based on their frequencies and their CVs in the cluster. For genomic data sets, we use the most significant CAs in a cluster to predict gene function.