Efficient layered density-based clustering of categorical data

Authors:
Bill Andreopoulos;Aijun An;Xiaogang Wang;Dirk Labudde
Affiliations:
Biotechnological Centre, Technische Universität Dresden, 47-51 Tatzberg, 01307 Dresden Sachsen, Germany and Dept. of Computer Science and Engineering, York University, Toronto, Canada;Dept. of Computer Science and Engineering, York University, Toronto, Canada;Dept. of Mathematics and Statistics, York University, Toronto, Canada;Biotechnological Centre, Technische Universität Dresden, 47-51 Tatzberg, 01307 Dresden Sachsen, Germany
Venue:
Journal of Biomedical Informatics
Year:
2009

Citing 26
Cited 0

Bayesian classification (AutoClass): theory and results

Advances in knowledge discovery and data mining
Automatic subspace clustering of high dimensional data for data mining applications

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Segmentation problems

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
CACTUS—clustering categorical data using summaries

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding generalized projected clusters in high dimensional spaces

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
ROCK: a robust clustering algorithm for categorical attributes

Information Systems
Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications

Data Mining and Knowledge Discovery
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values

Data Mining and Knowledge Discovery
Techniques of Cluster Algorithms in Data Mining

Data Mining and Knowledge Discovery
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Clustering Categorical Data: An Approach Based on Dynamical Systems

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
CLOPE: a fast and effective clustering algorithm for transactional data

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DHC: A Density-Based Hierarchical Clustering Method for Time Series Gene Expression Data

BIBE '03 Proceedings of the 3rd IEEE Symposium on BioInformatics and BioEngineering
Adaptive dimension reduction for clustering high dimensional data

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Entropy-based criterion in categorical clustering

ICML '04 Proceedings of the twenty-first international conference on Machine learning
CLICKS: Mining Subspace Clusters in Categorical Data via K-Partite Maximal Cliques

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Dimension induced clustering

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
A framework for projected clustering of high dimensional data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Clustering by common friends finds locally significant proteins mediating modules

Bioinformatics
Defining clusters from a hierarchical cluster tree

Bioinformatics
Finding molecular complexes through multiple layer clustering of protein interaction networks

International Journal of Bioinformatics Research and Applications
Database indexing for production MegaBLAST searches

Bioinformatics
Hierarchical density-based clustering of categorical data and a simplification

PAKDD'07 Proceedings of the 11th Pacific-Asia conference on Advances in knowledge discovery and data mining
Survey of clustering algorithms

IEEE Transactions on Neural Networks

Quantified Score

Hi-index	0.00

Visualization

Abstract

A challenge involved in applying density-based clustering to categorical biomedical data is that the ''cube'' of attribute values has no ordering defined, making the search for dense subspaces slow. We propose the HIERDENC algorithm for hierarchical density-based clustering of categorical data, and a complementary index for searching for dense subspaces efficiently. The HIERDENC index is updated when new objects are introduced, such that clustering does not need to be repeated on all objects. The updating and cluster retrieval are efficient. Comparisons with several other clustering algorithms showed that on large datasets HIERDENC achieved better runtime scalability on the number of objects, as well as cluster quality. By fast collapsing the bicliques in large networks we achieved an edge reduction of as much as 86.5%. HIERDENC is suitable for large and quickly growing datasets, since it is independent of object ordering, does not require re-clustering when new data emerges, and requires no user-specified input parameters.