Clustering algorithms for categorical data

  • Authors:
  • William Andreopoulos

  • Affiliations:
  • York University (Canada)

  • Venue:
  • Clustering algorithms for categorical data
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Categorical datasets in many domains, such as biology or software analysis, have a rich underlying cluster structure. Categorical clustering methods that are motivated by uncovering interesting local cluster structure could produce high clustering quality, and potentially help analysts to study hidden roles of objects in a dataset.This thesis presents several clustering algorithms for categorical data. First, we introduce the HIERDENC algorithm for hierarchical density-based clustering of categorical data. Then, we present the MULIC algorithm, which is a faster simplification of HIERDENC. MULIC is designed for categorical datasets with a multi-layered structure, such as protein interaction data. Our experimental evaluation of MULIC on such datasets shows that it can uncover their underlying structure better than other algorithms and has comparable runtimes.Next, we present the MULICsoft algorithm for clustering large software systems. MULICsoft is an extension of MULIC that incorporates in the clustering process information on a software system's runtime execution. We evaluate MULICsoft on a large open-source system. MULICsoft is able to produce decompositions that are close to those created by system experts.We continue with the BILCOM algorithm which is an extension of MULIC. BILCOM is used for clustering mixed categorical and numerical biomedical data. We apply BILCOM to datasets of mixed types, such as hepatitis, thyroid disease and yeast gene expression data with Gene Ontology annotations. The results show that BILCOM can partition these datasets significantly better than using just categorical or numerical type.Finally, we present the M-BILCOM algorithm, which is an extension of BILCOM for clustering mixed numerical and low quality categorical data. M-BILCOM incorporates in the clustering process the confidence on the categorical values' correctness. We apply M-BILCOM to yeast gene expression data with Gene Ontology-annotations and GO Evidence codes representing evidence on the annotations' correctness.