Clustering based on compressed data for categorical and mixed attributes

  • Authors:
  • Erendira Rendón;José Salvador Sánchez

  • Affiliations:
  • Lab. Reconocimiento de Patrones, Instituto Tecnológico de Toluca, Metepec, Mexico;Dept. Llenguatges i Sistemes Informàtics, Universitat Jaume I, Castelló de la Plana, Spain

  • Venue:
  • SSPR'06/SPR'06 Proceedings of the 2006 joint IAPR international conference on Structural, Syntactic, and Statistical Pattern Recognition
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Clustering in data mining is a discovery process that groups a set of data so as to maximize the intra-cluster similarity and to minimize the inter-cluster similarity. Clustering becomes more challenging when data are categorical and the amount of available memory is less than the size of the data set. In this paper, we introduce CBC (Clustering Based on Compressed Data), an extension of the Birch algorithm whose main characteristics refer to the fact that it can be especially suitable for very large databases and it can work both with categorical attributes and mixed features. Effectiveness and performance of the CBC procedure were compared with those of the well-known K-modes clustering algorithm, demonstrating that the CBC summary process does not affect the final clustering, while execution times can be drastically lessened.