Innovation in the cluster validating techniques

Authors:
Ravi Jain;Andy Koronios
Affiliations:
School of Computer and Information Sciences, University of South Australia, Adelaide, Australia;School of Computer and Information Sciences, University of South Australia, Adelaide, Australia
Venue:
Fuzzy Optimization and Decision Making
Year:
2008

Citing 8
Cited 0

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

Journal of Computational and Applied Mathematics
Data clustering: a review

ACM Computing Surveys (CSUR)
Cluster validity methods: part I

ACM SIGMOD Record
Clustering validity checking methods: part II

ACM SIGMOD Record
On Clustering Validation Techniques

Journal of Intelligent Information Systems
Cluster validation techniques for genome expression data

Signal Processing - Special issue: Genomic signal processing
Comparison of clustering methods for clinical databases

Information Sciences—Informatics and Computer Science: An International Journal - Mining stream data
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

To detect database records containing approximate and exact duplicates because of data entry error or differences in the detailed schemas of records from multiple databases or for some other reasons is an important line of research. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of Silhouette width, Calinski & Harbasz index (pseudo F-statistics) and Baker & Hubert index (驴 index) algorithms for exact and approximate duplicates. In this paper, a comparative study and effectiveness of these three cluster validation techniques which involve measuring the stability of a partition in a data set in the presence of noise, in particular, approximate and exact duplicates are presented. Silhouette width, Calinski & Harbasz index and Baker & Hubert index are calculated before and after inserting the exact and approximate duplicates (deliberately) in the data set. Comprehensive experiments on glass, wine, iris and ruspini database confirms that the Baker & Hubert index is not stable in the presence of approximate duplicates. Moreover, Silhouette width, Calinski and Harbasz index and Baker & Hubert indice do not exceed the original data indice in the presence of approximate duplicates.