An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data

Authors:
Liping Jing;Michael K. Ng;Joshua Zhexue Huang
Affiliations:
-;-;-
Venue:
IEEE Transactions on Knowledge and Data Engineering
Year:
2007

Citing 21
Cited 35

Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A comparative study of clustering methods

Future Generation Computer Systems - Special double issue on data mining
Fast algorithms for projected clustering

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Entropy-based subspace clustering for mining numerical data

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data clustering: a review

ACM Computing Surveys (CSUR)
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Locally adaptive dimensionality reduction for indexing large time series databases

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
A Monte Carlo algorithm for fast projective clustering

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Feature Weighting in k-Means Clustering

Machine Learning
d-Clusters: Capturing Subspace Correlation in a Large Data Set

ICDE '02 Proceedings of the 18th International Conference on Data Engineering
Locally adaptive techniques for pattern classification

Locally adaptive techniques for pattern classification
Subspace clustering for high dimensional data: a review

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
HARP: A Practical Projected Clustering Algorithm

IEEE Transactions on Knowledge and Data Engineering
Automated Variable Weighting in k-Means Type Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
On Discovery of Extremely Low-Dimensional Clusters Using Semi-Supervised Projected Clustering

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Document Clustering Using Locality Preserving Indexing

IEEE Transactions on Knowledge and Data Engineering
On the performance of feature weighting K-means for text subspace clustering

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Supplier categorization with K-means type subspace clustering

APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
Subspace clustering of text documents with feature weighting k-means algorithm

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm

Computational Statistics & Data Analysis
Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering

ACM Transactions on Knowledge Discovery from Data (TKDD)
Enhanced soft subspace clustering integrating within-cluster and between-cluster information

Pattern Recognition
New Labeling Strategy for Semi-supervised Document Categorization

KSEM '09 Proceedings of the 3rd International Conference on Knowledge Science, Engineering and Management
From variable weighting to cluster characterization in topographic unsupervised learning

IJCNN'09 Proceedings of the 2009 international joint conference on Neural Networks
ISMCS: an intelligent instruction sequence based malware categorization system

ASID'09 Proceedings of the 3rd international conference on Anti-Counterfeiting, security, and identification in communication
SKM-SNP: SNP markers detection method

Journal of Biomedical Informatics
Automatic malware categorization using cluster ensemble

Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Document clustering using synthetic cluster prototypes

Data & Knowledge Engineering
A k-means type clustering algorithm for subspace clustering of mixed numeric and categorical datasets

Pattern Recognition Letters
Class-dependent projection based method for text categorization

Pattern Recognition Letters
Integrating Document Clustering and Multidocument Summarization

ACM Transactions on Knowledge Discovery from Data (TKDD)
A subspace decision cluster classifier for text classification

Expert Systems with Applications: An International Journal
A novel attribute weighting algorithm for clustering high-dimensional categorical data

Pattern Recognition
Gene expression data analysis with the clustering method based on an improved quantum-behaved Particle Swarm Optimization

Engineering Applications of Artificial Intelligence
Feature interaction in subspace clustering using the Choquet integral

Pattern Recognition
Integrative parameter-free clustering of data with mixed type attributes

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part I
Partitive clustering (K-means family)

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Subspace clustering

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Effective fuzzy semantic clustering scheme for decentralised network through multi-domain ontology model

International Journal of Metadata, Semantics and Ontologies
Post-processing strategies for improving local gene expression pattern analysis

International Journal of Data Mining and Bioinformatics
A New Locally Weighted K-Means for Cancer-Aided Microarray Data Analysis

Journal of Medical Systems
A survey on enhanced subspace clustering

Data Mining and Knowledge Discovery
Novel soft subspace clustering with multi-objective evolutionary approach for high-dimensional data

Pattern Recognition
Probability-based text clustering algorithm by alternately repeating two operations

Journal of Information Science
Fuzzy partition based soft subspace clustering and its applications in high dimensional data

Information Sciences: an International Journal
Projected-prototype based classifier for text categorization

Knowledge-Based Systems
Effective fuzzy semantic clustering scheme for decentralised network through multi-domain ontology model

International Journal of Metadata, Semantics and Ontologies
Central clustering of categorical data with automated feature weighting

IJCAI'13 Proceedings of the Twenty-Third international joint conference on Artificial Intelligence
Evolving soft subspace clustering

Applied Soft Computing
Dynamic clustering of histogram data based on adaptive squared Wasserstein distances

Expert Systems with Applications: An International Journal
Mutual information evaluation: A way to predict the performance of feature weighting on clustering

Intelligent Data Analysis
Robust local feature weighting hard c-means clustering algorithm

Neurocomputing
Subspace clustering of high-dimensional data: an evolutionary approach

Applied Computational Intelligence and Soft Computing
Unsupervised approach data analysis based on fuzzy possibilistic clustering: application to medical image MRI

Computational Intelligence and Neuroscience

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a new k-means type algorithm for clustering high-dimensional objects in subspaces. In high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. For example, in text clustering, clusters of documents of different topics are categorized by different subsets of terms or keywords. The keywords for one cluster may not occur in the documents of other clusters. This is a data sparsity problem faced in clustering high-dimensional data. In the new algorithm, we extend the k{\hbox{-}}{\rm{means}} clustering process to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important dimensions that categorize different clusters. This is achieved by including the weight entropy in the objective function that is minimized in the k{\hbox{-}}{\rm{means}} clustering process. An additional step is added to the k{\hbox{-}}{\rm{means}} clustering process to automatically compute the weights of all dimensions in each cluster. The experiments on both synthetic and real data have shown that the new algorithm can generate better clustering results than other subspace clustering algorithms. The new algorithm is also scalable to large data sets.