Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

  • Authors:
  • Zhexue Huang

  • Affiliations:
  • ACSys CRC, CSIRO Mathematical and Information Sciences, GPO Box 664, Canberra, ACT 2601, Australia. huang@mip.com.au

  • Venue:
  • Data Mining and Knowledge Discovery
  • Year:
  • 1998

Quantified Score

Hi-index 0.02

Visualization

Abstract

The k-means algorithm is well known for its efficiency in clusteringlarge data sets. However, working only on numeric values prohibits itfrom being used to cluster real world data containingcategorical values. In this paper we present two algorithms whichextend the k-means algorithm to categorical domains and domains withmixed numeric and categorical values. The k-modes algorithm uses asimple matching dissimilarity measure to deal with categoricalobjects, replaces the means of clusters with modes, and uses afrequency-based method to update modes in the clustering process tominimise the clustering cost function. With these extensions thek-modes algorithm enables the clustering of categorical data in afashion similar to k-means. The k-prototypes algorithm, throughthe definition of a combined dissimilarity measure, further integratesthe k-means and k-modes algorithms to allow for clustering objectsdescribed by mixed numeric and categorical attributes. We use the wellknown soybean disease and credit approval data setsto demonstrate the clustering performance of the two algorithms. Ourexperiments on two real world data sets with half a million objectseach show that the two algorithms are efficient when clustering largedata sets, which is critical to data mining applications.