A robust and scalable clustering algorithm for mixed type attributes in large database environment

Authors:
Tom Chiu;DongPing Fang;John Chen;Yao Wang;Christopher Jeris
Affiliations:
American Century Investments, Kansas City, MO;SPSS Inc., Chicago, IL;SPSS Inc., Chicago, IL;SPSS Inc., Chicago, IL;SPSS Inc., Chicago, IL
Venue:
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2001

Citing 3
Cited 30

CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
BIRCH: A New Data Clustering Algorithm and Its Applications

Data Mining and Knowledge Discovery
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery

Scalable Model-based Clustering by Working on Data Summaries

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Hypergraph Models and Algorithms for Data-Pattern-Based Clustering

Data Mining and Knowledge Discovery
Knowledge discovery by probabilistic clustering of distributed databases

Data & Knowledge Engineering
Learning States and Rules for Detecting Anomalies in Time Series

Applied Intelligence
What do we know about mobile internet adopters? A cluster analysis

Information and Management
Lessons learned from i-mode: What makes consumers click wireless banner ads?

Computers in Human Behavior
Ex-ray: Data mining and mental health

Applied Soft Computing
Hierarchical clustering of mixed data based on distance hierarchy

Information Sciences: an International Journal
Definition of MV load diagrams via weighted evidence accumulation clustering using subsampling

ISPRA'07 Proceedings of the 6th WSEAS International Conference on Signal Processing, Robotics and Automation
Network snomaly detection based on semi-supervised clustering

SMO'07 Proceedings of the 7th WSEAS International Conference on Simulation, Modelling and Optimization
Definition of MV load diagrams via weighted evidence accumulation clustering using subsampling

ISPRA'07 Proceedings of the 6th WSEAS International Conference on Signal Processing, Robotics and Automation
Exploring the relationship between software project duration and risk exposure: A cluster analysis

Information and Management
Comparing the performance of traditional cluster analysis, self-organizing maps and fuzzy C-means method for strategic grouping

Expert Systems with Applications: An International Journal
Profiling Retail Web Site Functionalities and Conversion Rates: A Cluster Analysis

International Journal of Electronic Commerce
Finding approximate solutions to combinatorial problems with very large data sets using BIRCH

Computational Statistics & Data Analysis
What do we know about mobile Internet adopters? A cluster analysis

Information and Management
A machine learning-based approach to prognostic analysis of thoracic transplantations

Artificial Intelligence in Medicine
Enhanced k-means clustering for patient reported outcome

CEA'10 Proceedings of the 4th WSEAS international conference on Computer engineering and applications
An approach to quantitatively measuring collaborative performance in online conversations

Computers in Human Behavior
Clustering scientific literature using sparse citation graph analysis

PKDD'06 Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases
Clustering mixed type attributes in large dataset

ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
A dissimilarity measure for the k-Modes clustering algorithm

Knowledge-Based Systems
Applying the Mahalanobis-Taguchi strategy for software defect diagnosis

Automated Software Engineering
An optimal cluster-based approach for Subgroup Analysis using Information Complexity Criterion

International Journal of Business Intelligence and Data Mining
Graphical method to find optimal cluster centroid for two-variable linear functions of concept-drift categorical data

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
Measuring Collective Cognition in Online Collaboration Venues

International Journal of e-Collaboration
A framework for strategy formulation based on clustering approach: A case study in a corporate organization

Knowledge-Based Systems
Competitive positioning and performance assessment in the construction industry

Expert Systems with Applications: An International Journal
Data integration techniques for the measurement of the reliability of sample variables

International Journal of Business Intelligence and Data Mining
Estimating the predominant number of clusters in a dataset

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is a widely used technique in data mining applications to discover patterns in the underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either continuous or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining problems. In this paper, we propose a distance measure that enables clustering data with both continuous and categorical attributes. This distance measure is derived from a probabilistic model that the distance between two clusters is equivalent to the decrease in log-likelihood function as a result of merging. Calculation of this measure is memory efficient as it depends only on the merging cluster pair and not on all the other clusters. Zhang et al [8] proposed a clustering method named BIRCH that is especially suitable for very large datasets. We develop a clustering algorithm using our distance measure based on the framework of BIRCH. Similar to BIRCH, our algorithm first performs a pre-clustering step by scanning the entire dataset and storing the dense regions of data records in terms of summary statistics. A hierarchical clustering algorithm is then applied to cluster the dense regions. Apart from the ability of handling mixed type of attributes, our algorithm differs from BIRCH in that we add a procedure that enables the algorithm to automatically determine the appropriate number of clusters and a new strategy of assigning cluster membership to noisy data. For data with mixed type of attributes, our experimental results confirm that the algorithm not only generates better quality clusters than the traditional k-means algorithms, but also exhibits good scalability properties and is able to identify the underlying number of clusters in the data correctly. The algorithm is implemented in the commercial data mining tool Clementine 6.0 which supports the PMML standard of data mining model deployment.