Squeezer: an efficient algorithm for clustering categorical data

Authors:
He Zengyou;Xu Xiaofei;Deng Shengchun
Affiliations:
Department of Computer Science and Engineering, Harbin Institute of Technology Harbin 150001, P.R. China;Department of Computer Science and Engineering, Harbin Institute of Technology Harbin 150001, P.R. China;Department of Computer Science and Engineering, Harbin Institute of Technology Harbin 150001, P.R. China
Venue:
Journal of Computer Science and Technology
Year:
2002

Citing 12
Cited 16

BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Clustering transactions using large items

Proceedings of the eighth international conference on Information and knowledge management
Two-phase clustering process for outliers detection

Pattern Recognition Letters
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Clustering Categorical Data: An Approach Based on Dynamical Systems

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
C2P: Clustering based on Closest Pairs

Proceedings of the 27th International Conference on Very Large Data Bases
Clustering data streams

FOCS '00 Proceedings of the 41st Annual Symposium on Foundations of Computer Science
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering

Discovering cluster-based local outliers

Pattern Recognition Letters
TCSOM: Clustering Transactions Using Self-Organizing Map

Neural Processing Letters
A clustering-based method for unsupervised intrusion detections

Pattern Recognition Letters
A k-mean clustering algorithm for mixed numeric and categorical data

Data & Knowledge Engineering
Hierarchical clustering of mixed data based on distance hierarchy

Information Sciences: an International Journal
MMR: An algorithm for clustering categorical data using Rough Set Theory

Data & Knowledge Engineering
k-ANMI: A mutual information based clustering algorithm for categorical data

Information Fusion
G-ANMI: A mutual information based genetic clustering algorithm for categorical data

Knowledge-Based Systems
Fuzzy clustering based ad recommendation for TV programs

EuroITV'07 Proceedings of the 5th European conference on Interactive TV: a shared experience
A hybrid clustering algorithm

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 1
Improving k-modes algorithm considering frequencies of attribute values in mode

CIS'05 Proceedings of the 2005 international conference on Computational Intelligence and Security - Volume Part I
A dissimilarity measure for the k-Modes clustering algorithm

Knowledge-Based Systems
Feature selection and clustering in software quality prediction

EASE'07 Proceedings of the 11th international conference on Evaluation and Assessment in Software Engineering
Clustering categorical data streams

Journal of Computational Methods in Sciences and Engineering
Automatic discovery of the root causes for quality drift in high dimensionality manufacturing processes

Journal of Intelligent Manufacturing
Hamming Distance based Clustering Algorithm

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents a new efficient algorithm for clustering categorical data, Squeezer, which can produce high quality clustering results and at the same time deserve good scalability. The Squeezer algorithm reads each tuple t in sequence, either assigning t to an existing cluster (initially none), or creating t as a new cluster, which is determined by the similarities between t and clusters. Due to its characteristics, the proposed algorithm is extremely suitable for clustering data streams, where given a sequence of points, the objective is to maintain consistently good clustering of the sequence so far, using a small amount of memory and time. Outliers can also be handled efficiently and directly in Squeezer. Experimental results on real-life and synthetic datasets verify the superiority of Squeezer.