Feature-based cluster validation for high-dimensional data

Authors:
Randa Kassab;Jean-Charles Lamirel
Affiliations:
LORIA - INRIA Lorraine, Cedex, France;LORIA - INRIA Lorraine, Cedex, France
Venue:
AIA '08 Proceedings of the 26th IASTED International Conference on Artificial Intelligence and Applications
Year:
2008

Citing 5
Cited 2

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Information Retrieval

Information Retrieval
On Clustering Validation Techniques

Journal of Intelligent Information Systems
Performance Evaluation of Some Clustering Algorithms and Validity Indices

IEEE Transactions on Pattern Analysis and Machine Intelligence
Pattern Recognition, Third Edition

Pattern Recognition, Third Edition

A new incremental growing neural gas algorithm based on clusters labeling maximization: application to clustering of heterogeneous textual data

IEA/AIE'10 Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part III
A new efficient and unbiased approach for clustering quality evaluation

PAKDD'11 Proceedings of the 15th international conference on New Frontiers in Applied Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cluster validation is commonly used to determine the optimal number of clusters in a data set. Despite the success of distance-based validity indexes, their efficacy decreases rapidly when dealing with high-dimensional data. The present paper introduces a feature-based cluster validation criterion which can cope with said situation. In contrast to distance-based methods, our criterion evaluates similarity in terms of shared relevant features between data. The idea is based on the identification of the "core" features which are correlated within the description of each of the discovered clusters. The individual quality of each cluster is then evaluated through the frequency of the core features with respect to that of the non-core features within the cluster, while the between-cluster isolation is measured by means of the overlap coefficient between clusters, considering only the core features within the clusters. The overall clustering quality is measured by a weighted combination of the within and between cluster correlation coefficients, which enables choosing an appropriate number of clusters according to the purpose of clustering. Furthermore, our validation can prune out unreliable clusters which have no correlated features and thus no specific description of their content. Extensive experiments on the Reuters-21578 collection are conducted to show the effectiveness of our validation criterion.