Using Category-Based Adherence to Cluster Market-Basket Data

  • Authors:
  • Ching-Huang Yun;Kun-Ta Chuang;Ming-Syan Chen

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we devise an efficient algorithm for clusteringmarket-basket data. Different from those of the traditionaldata, the features of market-basket data are knownto be of high dimensionality, sparsity, and with massive out-liers.Without explicitly considering the presence of the tax-onomy,most prior efforts on clustering market-basket datacan be viewed as dealing with items in the leaf level of thetaxonomy tree. Clustering transactions across different levelsof the taxonomy is of great importance for marketingstrategies as well as for the result representation of the clusteringtechniques for market-basket data. In view of thefeatures of market-basket data, we devise in this paper anovel measurement, called the category-based adherence,and utilize this measurement to perform the clustering. Thedistance of an item to a given cluster is defined as the numberof links between this item and its nearest large node inthe taxonomy tree where a large node is an item (i.e., leaf)or a category (i.e., internal) node whose occurrence countexceeds a given threshold. The category-based adherenceof a transaction to a cluster is then defined as the averagedistance of the items in this transaction to that cluster.With this category-based adherence measurement, wedevelop an efficient clustering algorithm, called algorithmCBA (standing for Category-Based Adherence), for market-basketdata with the objective to minimize the category-basedadherence. A validation model based on InformationGain (IG) is also devised to assess the quality of clusteringfor market-basket data. As validated by both real and syntheticdatasets, it is shown by our experimental results, withthe taxonomy information, algorithm CBA devised in thispaper significantly outperforms the prior works in both theexecution efficiency and the clustering quality for market-basketdata.