A data labeling method for clustering categorical data

  • Authors:
  • Fuyuan Cao;Jiye Liang

  • Affiliations:
  • School of Computer and Information Technology, Shanxi University, Taiyuan, 030006 Shanxi, China and Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of E ...;School of Computer and Information Technology, Shanxi University, Taiyuan, 030006 Shanxi, China and Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of E ...

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2011

Quantified Score

Hi-index 12.06

Visualization

Abstract

As the size of data growing at a rapid pace, clustering a very large data set inevitably incurs a time-consuming process. To improve the efficiency of clustering, sampling is usually used to scale down the size of data set. However, with sampling applied, how to allocate unlabeled objects into proper clusters is a very difficult problem. In this paper, based on the frequency of attribute values in a given cluster and the distributions of attribute values in different clusters, a novel similarity measure is proposed to allocate each unlabeled object into the corresponding appropriate cluster for clustering categorical data. Furthermore, a labeling algorithm for categorical data is presented, and its corresponding time complexity is analyzed as well. The effectiveness of the proposed algorithm is shown by the experiments on real-world data sets.