Towards understanding hierarchical clustering: A data distribution perspective

Authors:
Junjie Wu;Hui Xiong;Jian Chen
Affiliations:
Information Systems Department, School of Economics and Management, Beihang University, Beijing 100083, China;Management Science and Information Systems Department, Rutgers Business School, Rutgers University, Newark, NJ 07102, USA;Research Center for Contemporary Management, Key Research Institute of Humanities and Social Sciences at Universities, School of Economics and Management, Tsinghua University, Beijing 100084, Chin ...
Venue:
Neurocomputing
Year:
2009

Citing 24
Cited 1

Algorithms for clustering data

Algorithms for clustering data
OHSUMED: an interactive retrieval evaluation and new large test collection for research

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
BIRCH: an efficient data clustering method for very large databases

SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Applied numerical linear algebra

Applied numerical linear algebra
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
WebACE: a Web agent for document categorization and exploration

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Data clustering analysis in a multidimensional space

Information Sciences: an International Journal
Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Information Retrieval

Information Retrieval
Cluster validity methods: part I

ACM SIGMOD Record
Clustering validity checking methods: part II

ACM SIGMOD Record
Chameleon: Hierarchical Clustering Using Dynamic Modeling

Computer
Mining Strong Affinity Association Patterns in Data Sets with Skewed Support Distribution

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Comparison of clustering methods for clinical databases

Information Sciences—Informatics and Computer Science: An International Journal - Mining stream data
A database clustering methodology and tool

Information Sciences—Informatics and Computer Science: An International Journal
Kernel Principle Component Analysis in Pixels Clustering

WI '05 Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence
Introduction to Data Mining, (First Edition)

Introduction to Data Mining, (First Edition)
Enhancing Data Analysis with Noise Removal

IEEE Transactions on Knowledge and Data Engineering
K-means clustering versus validation measures: a data distribution perspective

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Hyperclique pattern discovery

Data Mining and Knowledge Discovery
Clustering Using a Similarity Measure Based on Shared Near Neighbors

IEEE Transactions on Computers
Geometric Mean for Subspace Selection

IEEE Transactions on Pattern Analysis and Machine Intelligence
Comparing dimension reduction techniques for document clustering

AI'05 Proceedings of the 18th Canadian Society conference on Advances in Artificial Intelligence
Binary Two-Dimensional PCA

IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics

Validation of overlapping clustering: A random clustering perspective

Information Sciences: an International Journal

Quantified Score

Hi-index	0.01

Visualization

Abstract

A very important category of clustering methods is hierarchical clustering. There are considerable research efforts which have been focused on algorithm-level improvements of the hierarchical clustering process. In this paper, our goal is to provide a systematic understanding of hierarchical clustering from a data distribution perspective. Specifically, we investigate the issues about how the ''true'' cluster distribution can make impact on the clustering performance, and what is the relationship between hierarchical clustering schemes and validation measures with respect to different data distributions. To this end, we provide an organized study to illustrate these issues. Indeed, one of our key findings reveals that hierarchical clustering tends to produce clusters with high variation on cluster sizes regardless of ''true'' cluster distributions. Also, our results show that F-measure, an external clustering validation measure, has bias towards hierarchical clustering algorithms which tend to increase the variation on cluster sizes. Viewed in light of this, we propose F"n"o"r"m, the normalized version of the F-measure, to solve the cluster validation problem for hierarchical clustering. Experimental results show that F"n"o"r"m is indeed more suitable than the unnormalized F-measure in evaluating the hierarchical clustering results across data sets with different data distributions.