Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Authors:
Zhexue Huang
Affiliations:
ACSys CRC, CSIRO Mathematical and Information Sciences, GPO Box 664, Canberra, ACT 2601, Australia. huang@mip.com.au
Venue:
Data Mining and Knowledge Discovery
Year:
1998

Citing 12
Cited 196

How many clusters are best?—an experiment

Pattern Recognition
Algorithms for clustering data

Algorithms for clustering data
Learning Based on Conceptual Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Symbolic clustering using a new dissimilarity measure

Pattern Recognition
C4.5: programs for machine learning

C4.5: programs for machine learning
A conceptual version of the K-means algorithm

Pattern Recognition Letters
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Knowledge discovery in databases terminology

Advances in knowledge discovery and data mining
Pattern Recognition with Fuzzy Objective Function Algorithms

Pattern Recognition with Fuzzy Objective Function Algorithms
Experiments with Incremental Concept Formation: UNIMEM

Machine Learning
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases

A robust and scalable clustering algorithm for mixed type attributes in large database environment

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
FREM: fast and robust EM clustering for large data sets

Proceedings of the eleventh international conference on Information and knowledge management
An iterative initial-points refinement algorithm for categorical data clustering

Pattern Recognition Letters
Redefining Clustering for High-Dimensional Applications

IEEE Transactions on Knowledge and Data Engineering
On distributing the clustering process

Pattern Recognition Letters
The new k-windows algorithm for improving the k-means clustering algorithm

Journal of Complexity
Value Range Queries on Earth Science Data via Histogram Clustering

TSDM '00 Proceedings of the First International Workshop on Temporal, Spatial, and Spatio-Temporal Data Mining-Revised Papers
An Improved Recommendation Algorithm in Collaborative Filtering

EC-WEB '02 Proceedings of the Third International Conference on E-Commerce and Web Technologies
Extended K-means with an Efficient Estimation of the Number of Clusters

IDEAL '00 Proceedings of the Second International Conference on Intelligent Data Engineering and Automated Learning, Data Mining, Financial Engineering, and Intelligent Agents
A Tabu Search Based Algorithm for Clustering Categorical Data Sets

IDEAL '00 Proceedings of the Second International Conference on Intelligent Data Engineering and Automated Learning, Data Mining, Financial Engineering, and Intelligent Agents
An Interactive Approach to Building Classification Models by Clustering and Cluster Validation

IDEAL '00 Proceedings of the Second International Conference on Intelligent Data Engineering and Automated Learning, Data Mining, Financial Engineering, and Intelligent Agents
Evolutionary Hot Spots Data Mining - An Architecture for Exploring for Interesting Discoveries

PAKDD '99 Proceedings of the Third Pacific-Asia Conference on Methodologies for Knowledge Discovery and Data Mining
A Visual Method of Cluster Validation with Fastmap

PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
M-FastMap: A Modified FastMap Algorithm for Visual Cluster Validation in Data Mining

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
On Data Clustering Analysis: Scalability, Constraints, and Validation

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Clustering Large Categorical Data

PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Extending K-Means Clustering to First-Order Representations

ILP '00 Proceedings of the 10th International Conference on Inductive Logic Programming
A Cube Model and Cluster Analysis for Web Access Sessions

WEBKDD '01 Revised Papers from the Third International Workshop on Mining Web Log Data Across All Customers Touch Points
Clustering binary data streams with K-means

DMKD '03 Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
Clustering Item Data Sets with Association-Taxonomy Similarity

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Fast and Robust General Purpose Clustering Algorithms

Data Mining and Knowledge Discovery
A data cube model for prediction-based web prefetching

Journal of Intelligent Information Systems - Special issue on web intelligence
Efficient Disk-Based K-Means Clustering for Relational Databases

IEEE Transactions on Knowledge and Data Engineering
Entropy-based criterion in categorical clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Fuzzy clustering of categorical data using fuzzy centroids

Pattern Recognition Letters
Subspace clustering for high dimensional categorical data

ACM SIGKDD Explorations Newsletter
Automated Variable Weighting in k-Means Type Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
Categorical data visualization and clustering using subjective factors

Data & Knowledge Engineering
Clustering mixed numerical and low quality categorical data: significance metrics on a yeast example

Proceedings of the 2nd international workshop on Information quality in information systems
CLICKS: an effective algorithm for mining subspace clusters in categorical datasets

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
TCSOM: Clustering Transactions Using Self-Organizing Map

Neural Processing Letters
Post-processing clustering to reduce XCS variability

GECCO '05 Proceedings of the 7th annual workshop on Genetic and evolutionary computation
Labeling Unclustered Categorical Data into Clusters Based on the Important Attribute Values

ICDM '05 Proceedings of the Fifth IEEE International Conference on Data Mining
An Integrated Framework for Visualized and Exploratory Pattern Discovery in Mixed Data

IEEE Transactions on Knowledge and Data Engineering
Computing LTS Regression for Large Data Sets

Data Mining and Knowledge Discovery
A Unified View on Clustering Binary Data

Machine Learning
Adherence clustering: an efficient method for mining market-basket clusters

Information Systems
A Bit Level Representation for Time Series Data Mining with Shape Based Similarity

Data Mining and Knowledge Discovery
Efficiently clustering transactional data with weighted coverage density

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
A partitional clustering algorithm validated by a clustering tendency index based on graph theory

Pattern Recognition
A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set

Pattern Recognition Letters
Clustering large software systems at multiple layers

Information and Software Technology
A semi-supervised regression model for mixed numerical and categorical variables

Pattern Recognition
On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm

IEEE Transactions on Pattern Analysis and Machine Intelligence
A k-mean clustering algorithm for mixed numeric and categorical data

Data & Knowledge Engineering
Hierarchical clustering of mixed data based on distance hierarchy

Information Sciences: an International Journal
MMR: An algorithm for clustering categorical data using Rough Set Theory

Data & Knowledge Engineering
Strategies for Identifying Statistically Significant Dense Regions in Microarray Data

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data

IEEE Transactions on Knowledge and Data Engineering
A fuzzy k-partitions model for categorical data and its comparison to the GoM model

Fuzzy Sets and Systems
An adaptable deflect and conquer clustering algorithm

ACOS'07 Proceedings of the 6th Conference on WSEAS International Conference on Applied Computer Science - Volume 6
k-ANMI: A mutual information based clustering algorithm for categorical data

Information Fusion
Distance functions for categorical and mixed variables

Pattern Recognition Letters
Network snomaly detection based on semi-supervised clustering

SMO'07 Proceedings of the 7th WSEAS International Conference on Simulation, Modelling and Optimization
Mining categories for emails via clustering and pattern discovery

Journal of Intelligent Information Systems
Finding molecular complexes through multiple layer clustering of protein interaction networks

International Journal of Bioinformatics Research and Applications
Bi-level clustering of mixed categorical and numerical biomedical data

International Journal of Data Mining and Bioinformatics
Incremental clustering of mixed data based on distance hierarchy

Expert Systems with Applications: An International Journal
Mining typical patterns from databases

Information Sciences: an International Journal
On clustering tree structured data with categorical nature

Pattern Recognition
A Bounded Index for Cluster Validity

MLDM '07 Proceedings of the 5th international conference on Machine Learning and Data Mining in Pattern Recognition
A stroll with Carletto: adaptation in drama-based tours with virtual characters

User Modeling and User-Adapted Interaction
Determining the best K for clustering transactional datasets: A coverage density-based approach

Data & Knowledge Engineering
Improving Prediction Quality in Collaborative Filtering Based on Clustering

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
A comprehensive validity index for clustering

Intelligent Data Analysis
Multifractal-based cluster hierarchy optimisation algorithm

International Journal of Business Intelligence and Data Mining
Constraint-based clustering and its applications in construction management

Expert Systems with Applications: An International Journal
Development of an adaptive learning case recommendation approach for problem-based e-learning on mathematics teaching for students with mild disabilities

Expert Systems with Applications: An International Journal
Efficient layered density-based clustering of categorical data

Journal of Biomedical Informatics
A method for improving the accuracy of data mining classification algorithms

Computers and Operations Research
A new initialization method for categorical data clustering

Expert Systems with Applications: An International Journal
Models for association rules based on clustering and correlation

Intelligent Data Analysis
Effective spatial clustering methods for optimal facility establishment

Intelligent Data Analysis
A spectral-based clustering algorithm for categorical data using data summaries

Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors
An Outlier Detection Algorithm Based on Arbitrary Shape Clustering

ADMA '09 Proceedings of the 5th International Conference on Advanced Data Mining and Applications
Context-Based Distance Learning for Categorical Data Clustering

IDA '09 Proceedings of the 8th International Symposium on Intelligent Data Analysis: Advances in Intelligent Data Analysis VIII
From comparing clusterings to combining clusterings

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2
Computation of initial modes for K-modes clustering algorithm using evidence accumulation

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
The fuzzy C-means algorithm with fuzzy P-mode prototypes for clustering objects having mixed features

Fuzzy Sets and Systems
Adaptive learning of ordinal--numerical mappings through fuzzy clustering for the objects of mixed features

Fuzzy Sets and Systems
Non-segmented Document Clustering Using Self-Organizing Map and Frequent Max Substring Technique

ICONIP '09 Proceedings of the 16th International Conference on Neural Information Processing: Part II
Shadowed c-means: Integrating fuzzy and rough clustering

Pattern Recognition
Adherence clustering: an efficient method for mining market-basket clusters

Information Systems
SCALE: a scalable framework for efficiently clustering transactional data

Data Mining and Knowledge Discovery
Improvement of the fuzzy C-means clustering algorithm with adaptive learning of the dissimilarities among categorical feature

FUZZ-IEEE'09 Proceedings of the 18th international conference on Fuzzy Systems
Multiobjective genetic algorithm-based fuzzy clustering of categorical attributes

IEEE Transactions on Evolutionary Computation
G-ANMI: A mutual information based genetic clustering algorithm for categorical data

Knowledge-Based Systems
Rapid and brief communication: A k-populations algorithm for clustering categorical data

Pattern Recognition
A rough set approach for selecting clustering attribute

Knowledge-Based Systems
SKM-SNP: SNP markers detection method

Journal of Biomedical Informatics
AGRID: an efficient algorithm for clustering large high-dimensional datasets

PAKDD'03 Proceedings of the 7th Pacific-Asia conference on Advances in knowledge discovery and data mining
Post-processing clustering to decrease variability in XCS induced rulesets

IWLCS'03-05 Proceedings of the 2003-2005 international conference on Learning classifier systems
Fuzzy clustering based ad recommendation for TV programs

EuroITV'07 Proceedings of the 5th European conference on Interactive TV: a shared experience
Hierarchical density-based clustering of categorical data and a simplification

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
K-centers algorithm for clustering mixed type data

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Modified fuzzy c-means for ordinal valued attributes with particle swarm for optimization

Fuzzy Sets and Systems
Efficient k-anonymization using clustering techniques

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
Quantization-based clustering algorithm

Pattern Recognition
Data mining on multimedia data

Data mining on multimedia data
Efficient outlier detection algorithm for heterogeneous data streams

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 5
Enhancing principal direction divisive clustering

Pattern Recognition
Clustering with feature order preferences

Intelligent Data Analysis - Artificial Intelligence
The impact of goods-classification and landmarks for spatial knowledge and goods-finding in the elderly within a 3D virtual store

Computers in Human Behavior
A data labeling method for clustering categorical data

Expert Systems with Applications: An International Journal
Approximation algorithms for k-modes clustering

ICIC'06 Proceedings of the 2006 international conference on Intelligent computing: Part II
A framework for clustering categorical time-evolving data

IEEE Transactions on Fuzzy Systems
Distance-based outlier detection: consolidation and renewed bearing

Proceedings of the VLDB Endowment
DK-BKM: decremental K belief K-modes method

SUM'10 Proceedings of the 4th international conference on Scalable uncertainty management
Clustering categorical data using an extended modularity measure

ICONIP'10 Proceedings of the 17th international conference on Neural information processing: models and applications - Volume Part II
A case based reasoning approach on supplier selection in petroleum enterprises

Expert Systems with Applications: An International Journal
Integrating data mining with KJ method to classify bridge construction defects

Expert Systems with Applications: An International Journal
A new-fangled FES-k-Means clustering algorithm for disease discovery and visual analytics

EURASIP Journal on Bioinformatics and Systems Biology
A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional

Expert Systems with Applications: An International Journal
Active learning and subspace clustering for anomaly detection

Intelligent Data Analysis
XML data clustering: An overview

ACM Computing Surveys (CSUR)
Clustering the internet topology at the AS-level

SMO'05 Proceedings of the 5th WSEAS international conference on Simulation, modelling and optimization
Personalized web recommendation based on path clustering

ICCOMP'06 Proceedings of the 10th WSEAS international conference on Computers
An initialization method to simultaneously find initial cluster centers and the number of clusters for clustering categorical data

Knowledge-Based Systems
Aggregate distance based clustering using fibonacci series-FIBCLUS

APWeb'11 Proceedings of the 13th Asia-Pacific web conference on Web technologies and applications
A novel attribute weighting algorithm for clustering high-dimensional categorical data

Pattern Recognition
Enhancing grid-density based clustering for high dimensional data

Journal of Systems and Software
Agents, clusters and components: A synergistic approach to the GSP

Future Generation Computer Systems
A novel ant-based clustering algorithm using the kernel method

Information Sciences: an International Journal
Semi-supervised parameter-free divisive hierarchical clustering of categorical data

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part I
A Cluster-Based Context-Tree Model for Multivariate Data Streams with Applications to Anomaly Detection

INFORMS Journal on Computing
INCONCO: interpretable clustering of numerical and categorical objects

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
DISC: data-intensive similarity measure for categorical data

PAKDD'11 Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining - Volume Part II
SpectralCAT: Categorical spectral clustering of numerical and nominal data

Pattern Recognition
A feature group weighting method for subspace clustering of high-dimensional data

Pattern Recognition
Partitioning hard clustering algorithms based on multiple dissimilarity matrices

Pattern Recognition
Supervised visual vocabulary with category information

ACIVS'11 Proceedings of the 13th international conference on Advanced concepts for intelligent vision systems
Applying variable precision rough set model for clustering student suffering study's anxiety

Expert Systems with Applications: An International Journal
A new possibilistic clustering method: the possibilistic K-modes

AI*IA'11 Proceedings of the 12th international conference on Artificial intelligence around man and beyond
Ranking-based feature selection method for dynamic belief clustering

ICAIS'11 Proceedings of the Second international conference on Adaptive and intelligent systems
Content aggregation on knowledge bases using graph clustering

ESWC'06 Proceedings of the 3rd European conference on The Semantic Web: research and applications
Clustering mixed type attributes in large dataset

ISPA'05 Proceedings of the Third international conference on Parallel and Distributed Processing and Applications
A mixture model based markov random field for discovering patterns in sequences

SETN'06 Proceedings of the 4th Helenic conference on Advances in Artificial Intelligence
Improving k-modes algorithm considering frequencies of attribute values in mode

CIS'05 Proceedings of the 2005 international conference on Computational Intelligence and Security - Volume Part I
An extension of self-organizing maps to categorical data

EPIA'05 Proceedings of the 12th Portuguese conference on Progress in Artificial Intelligence
Modified adaptive resonance theory network for mixed data based on distance hierarchy

ICCS'06 Proceedings of the 6th international conference on Computational Science - Volume Part IV
A dissimilarity measure for the k-Modes clustering algorithm

Knowledge-Based Systems
Clustering approach using belief function theory

AIMSA'06 Proceedings of the 12th international conference on Artificial Intelligence: methodology, Systems, and Applications
Clustering mixed data based on evidence accumulation

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
DHCC: Divisive hierarchical clustering of categorical data

Data Mining and Knowledge Discovery
From Context to Distance: Learning Dissimilarity for Categorical Data Clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
A genetic k-modes algorithm for clustering categorical data

ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
Clustering categorical data using coverage density

ADMA'05 Proceedings of the First international conference on Advanced Data Mining and Applications
Kernel k-means for categorical data

IDA'05 Proceedings of the 6th international conference on Advances in Intelligent Data Analysis
Determining the number of clusters using information entropy for mixed data

Pattern Recognition
A cluster centers initialization method for clustering categorical data

Expert Systems with Applications: An International Journal
A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data

Knowledge-Based Systems
Personalized web recommendation based on path clustering

FQAS'06 Proceedings of the 7th international conference on Flexible Query Answering Systems
Co-clustering for binary data with maximum modularity

ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part II
Weighted topological clustering for categorical data

ICONIP'11 Proceedings of the 18th international conference on Neural Information Processing - Volume Part I
Integrative parameter-free clustering of data with mixed type attributes

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
A new clustering algorithm based on k-means using a line segment as prototype

CIARP'11 Proceedings of the 16th Iberoamerican Congress conference on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Clustering of heterogeneously typed data with soft computing - a case study

MICAI'11 Proceedings of the 10th international conference on Artificial Intelligence: advances in Soft Computing - Volume Part II
Algorithm for fuzzy clustering of mixed data with numeric and categorical attributes

ICDCIT'05 Proceedings of the Second international conference on Distributed Computing and Internet Technology
Partitive clustering (K-means family)

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
An efficient clustering algorithm based on histogram threshold

ACIIDS'12 Proceedings of the 4th Asian conference on Intelligent Information and Database Systems - Volume Part II
Generalizing the k-Windows clustering algorithm in metric spaces

Mathematical and Computer Modelling: An International Journal
Group RFM analysis as a novel framework to discover better customer consumption behavior

Expert Systems with Applications: An International Journal
Attribute value weighting in k-modes clustering

Expert Systems with Applications: An International Journal
Clustering categorical data streams

Journal of Computational Methods in Sciences and Engineering
Dependency clustering across measurement scales

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
Discovering urban spatial-temporal structure from human activity patterns

Proceedings of the ACM SIGKDD International Workshop on Urban Computing
LEFT-logical expressions feature transformation: a framework for transformation of symbolic features

ISNN'12 Proceedings of the 9th international conference on Advances in Neural Networks - Volume Part II
Semantically-grounded construction of centroids for datasets with textual attributes

Knowledge-Based Systems
Graphical method to find optimal cluster centroid for two-variable linear functions of concept-drift categorical data

Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology
A modification of the k-means method for quasi-unsupervised learning

Knowledge-Based Systems
Knowledge augmentation via incremental clustering: new technology for effective knowledge management

International Journal of Business Information Systems
Clustering heterogeneous data with mutual semi-supervision

SPIRE'12 Proceedings of the 19th international conference on String Processing and Information Retrieval
HTTP: a new framework for bus travel time prediction based on historical trajectories

Proceedings of the 20th International Conference on Advances in Geographic Information Systems
A bio inspired fuzzy k-modes clustring algorithm

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part III
Clustering based on rank distance with applications on DNA

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Clustering and labeling of multi-dimensional mixed structured data

Search Computing
A novel fuzzy clustering algorithm with between-cluster information for categorical data

Fuzzy Sets and Systems
Rough Set Based Clustering Using Active Learning Approach

International Journal of Artificial Life Research
RPKM: the rough possibilistic k-modes

ISMIS'12 Proceedings of the 20th international conference on Foundations of Intelligent Systems
ASCCN: Arbitrary Shaped Clustering Method with Compatible Nucleoids

International Journal of Data Warehousing and Mining
Hamming Distance based Clustering Algorithm

International Journal of Information Retrieval Research
Rough set based fuzzy k-modes for categorical data

SEMCCO'12 Proceedings of the Third international conference on Swarm, Evolutionary, and Memetic Computing
An improved genetic clustering algorithm for categorical data

PAKDD'12 Proceedings of the 2012 Pacific-Asia conference on Emerging Trends in Knowledge Discovery and Data Mining
A weighting k-modes algorithm for subspace clustering of categorical data

Neurocomputing
Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number

Pattern Recognition
A novel ant-based clustering algorithm using Renyi entropy

Applied Soft Computing
New cluster ensemble approach to integrative biological data analysis

International Journal of Data Mining and Bioinformatics
Novel class detection within classification for data streams

ISNN'13 Proceedings of the 10th international conference on Advances in Neural Networks - Volume Part II
Finite mixtures of unimodal beta and gamma densities and the $$k$$-bumps algorithm

Computational Statistics
Stock market co-movement assessment using a three-phase clustering method

Expert Systems with Applications: An International Journal
MAGE: A semantics retaining K-anonymization method for mixed data

Knowledge-Based Systems
Data integration techniques for the measurement of the reliability of sample variables

International Journal of Business Intelligence and Data Mining
Classifying and clustering in negative databases

Frontiers of Computer Science: Selected Publications from Chinese Universities
The k-modes type clustering plus between-cluster information for categorical data

Neurocomputing
Sentiment analysis based on clustering: a framework in improving accuracy and recognizing neutral opinions

Applied Intelligence
Simultaneous feature selection and clustering with mixed features by multi objective genetic algorithm

International Journal of Hybrid Intelligent Systems

Quantified Score

Hi-index	0.02

Visualization

Abstract

The k-means algorithm is well known for its efficiency in clusteringlarge data sets. However, working only on numeric values prohibits itfrom being used to cluster real world data containingcategorical values. In this paper we present two algorithms whichextend the k-means algorithm to categorical domains and domains withmixed numeric and categorical values. The k-modes algorithm uses asimple matching dissimilarity measure to deal with categoricalobjects, replaces the means of clusters with modes, and uses afrequency-based method to update modes in the clustering process tominimise the clustering cost function. With these extensions thek-modes algorithm enables the clustering of categorical data in afashion similar to k-means. The k-prototypes algorithm, throughthe definition of a combined dissimilarity measure, further integratesthe k-means and k-modes algorithms to allow for clustering objectsdescribed by mixed numeric and categorical attributes. We use the wellknown soybean disease and credit approval data setsto demonstrate the clustering performance of the two algorithms. Ourexperiments on two real world data sets with half a million objectseach show that the two algorithms are efficient when clustering largedata sets, which is critical to data mining applications.