A novel attribute weighting algorithm for clustering high-dimensional categorical data

Authors:
Liang Bai;Jiye Liang;Chuangyin Dang;Fuyuan Cao
Affiliations:
Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006 Shanxi, ...;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006 Shanxi, ...;Department of Manufacturing Engineering and Engineering Management, City University of Hong Kong, Hong Kong;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, School of Computer and Information Technology, Shanxi University, Taiyuan, 030006 Shanxi, ...
Venue:
Pattern Recognition
Year:
2011

Citing 41
Cited 5

Algorithms for clustering data

Algorithms for clustering data
Symbolic clustering using a new dissimilarity measure

Pattern Recognition
Selection of relevant features and examples in machine learning

Artificial Intelligence - Special issue on relevance
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A comparative study of clustering methods

Future Generation Computer Systems - Special double issue on data mining
Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Rough Sets: Theoretical Aspects of Reasoning about Data

Rough Sets: Theoretical Aspects of Reasoning about Data
Feature Selection for Knowledge Discovery and Data Mining

Feature Selection for Knowledge Discovery and Data Mining
A Monte Carlo algorithm for fast projective clustering

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
COOLCAT: an entropy-based algorithm for categorical clustering

Proceedings of the eleventh international conference on Information and knowledge management
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Knowledge Acquisition Via Incremental Conceptual Clustering

Machine Learning
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
d-Clusters: Capturing Subspace Correlation in a Large Data Set

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Clustering and its validation in a symbolic framework

Pattern Recognition Letters
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Biclustering Algorithms for Biological Data Analysis: A Survey

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
HARP: A Practical Projected Clustering Algorithm

IEEE Transactions on Knowledge and Data Engineering
Subspace clustering for high dimensional categorical data

ACM SIGKDD Explorations Newsletter
Automated Variable Weighting in k-Means Type Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
Clicks: An effective algorithm for mining subspace clusters in categorical datasets

Data & Knowledge Engineering
Locally adaptive metrics for clustering high dimensional data

Data Mining and Knowledge Discovery
An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data

IEEE Transactions on Knowledge and Data Engineering
Hybrid attribute reduction based on a novel fuzzy-rough model and information granulation

Pattern Recognition
Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data

IEEE Transactions on Knowledge and Data Engineering
Biclustering in data mining

Computers and Operations Research
A convergence theorem for the fuzzy subspace clustering (FSC) algorithm

Pattern Recognition
A new measure of uncertainty based on knowledge granulation for rough sets

Information Sciences: an International Journal
A new initialization method for categorical data clustering

Expert Systems with Applications: An International Journal
“Best K”: critical clustering structures in categorical datasets

Knowledge and Information Systems
Enhanced soft subspace clustering integrating within-cluster and between-cluster information

Pattern Recognition
Positive approximation: An accelerator for attribute reduction in rough set theory

Artificial Intelligence
A framework for clustering categorical time-evolving data

IEEE Transactions on Fuzzy Systems
A bi-clustering framework for categorical data

PKDD'05 Proceedings of the 9th European conference on Principles and Practice of Knowledge Discovery in Databases
A fuzzy subspace algorithm for clustering high dimensional data

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithms

IEEE Transactions on Pattern Analysis and Machine Intelligence

Determining the number of clusters using information entropy for mixed data

Pattern Recognition
Partitive clustering (K-means family)

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
A novel fuzzy clustering algorithm with between-cluster information for categorical data

Fuzzy Sets and Systems
Central clustering of categorical data with automated feature weighting

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
The k-modes type clustering plus between-cluster information for categorical data

Neurocomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Due to data sparseness and attribute redundancy in high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. To effectively address this issue, this paper presents a new optimization algorithm for clustering high-dimensional categorical data, which is an extension of the k-modes clustering algorithm. In the proposed algorithm, a novel weighting technique for categorical data is developed to calculate two weights for each attribute (or dimension) in each cluster and use the weight values to identify the subsets of important attributes that categorize different clusters. The convergence of the algorithm under an optimization framework is proved. The performance and scalability of the algorithm is evaluated experimentally on both synthetic and real data sets. The experimental studies show that the proposed algorithm is effective in clustering categorical data sets and also scalable to large data sets owning to its linear time complexity with respect to the number of data objects, attributes or clusters.