Clustering Item Data Sets with Association-Taxonomy Similarity
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
TCSOM: Clustering Transactions Using Self-Organizing Map
Neural Processing Letters
k-ANMI: A mutual information based clustering algorithm for categorical data
Information Fusion
Feature Selection in Taxonomies with Applications to Paleontology
DS '08 Proceedings of the 11th International Conference on Discovery Science
Hi-index | 0.00 |
In this paper, we devise an efficient algorithm for clusteringmarket-basket data. Different from those of the traditionaldata, the features of market-basket data are knownto be of high dimensionality, sparsity, and with massive out-liers.Without explicitly considering the presence of the tax-onomy,most prior efforts on clustering market-basket datacan be viewed as dealing with items in the leaf level of thetaxonomy tree. Clustering transactions across different levelsof the taxonomy is of great importance for marketingstrategies as well as for the result representation of the clusteringtechniques for market-basket data. In view of thefeatures of market-basket data, we devise in this paper anovel measurement, called the category-based adherence,and utilize this measurement to perform the clustering. Thedistance of an item to a given cluster is defined as the numberof links between this item and its nearest large node inthe taxonomy tree where a large node is an item (i.e., leaf)or a category (i.e., internal) node whose occurrence countexceeds a given threshold. The category-based adherenceof a transaction to a cluster is then defined as the averagedistance of the items in this transaction to that cluster.With this category-based adherence measurement, wedevelop an efficient clustering algorithm, called algorithmCBA (standing for Category-Based Adherence), for market-basketdata with the objective to minimize the category-basedadherence. A validation model based on InformationGain (IG) is also devised to assess the quality of clusteringfor market-basket data. As validated by both real and syntheticdatasets, it is shown by our experimental results, withthe taxonomy information, algorithm CBA devised in thispaper significantly outperforms the prior works in both theexecution efficiency and the clustering quality for market-basketdata.