CPCQ: Contrast pattern based clustering quality index for categorical data

  • Authors:
  • Qingbao Liu;Guozhu Dong

  • Affiliations:
  • College of Information System & Management, National University of Defense Technology, Changsha, Hunan 410073, China;Department of Computer Science & Engineering, Wright State University, Dayton, OH 45435, USA

  • Venue:
  • Pattern Recognition
  • Year:
  • 2012

Quantified Score

Hi-index 0.01

Visualization

Abstract

Clustering validation is concerned with assessing the quality of clustering solutions. Since clustering is unsupervised and highly explorative, clustering validation has been an important and long standing research problem. Existing validity measures, including entropy-based and distance-based indices, have significant shortcomings. Indeed, for many datasets from the UCI repository, they fail to recognize that the expert-determined classes are the best clusters and they frequently give preference to clusterings with larger number of clusters. Their weakness reflects their inability to accurately capture intra-cluster coherence and inter-cluster separation. This paper proposes a novel Contrast Pattern based Clustering Quality index (CPCQ) for categorical data, by utilizing the quality and diversity of the contrast patterns, which contrast the clusters in given clusterings. High quality contrast patterns can serve to characterize the clusters and discriminate one cluster against the others. The CPCQ index is based on the rationale that a high-quality clustering should have many diversified high-quality contrast patterns among its clusters. The quality of individual contrast patterns is defined in terms of their length, support, and the length of their corresponding closed pattern. The quality measure concerning ''many diversified'' contrast patterns is defined in terms of the quality and diversity of some selected groups of contrast patterns with minimal overlap among contrast patterns and groups in terms of items and matching transactions. Experiments show that the CPCQ index (1) does not require a user to provide a distance function; (2) does not give inappropriate preference to larger number of clusters; (3) can recognize that expert-determined classes are the best clusters for many datasets from the UCI repository.