Improved graph-based metrics for clustering high-dimensional datasets

Authors:
Ariel E. Bayá;Pablo M. Granitto
Affiliations:
French Argentine International Center for Information and Systems, Sciences, UPCAM, France and UNR, CONICET, Rosario, Argentina;French Argentine International Center for Information and Systems, Sciences, UPCAM, France and UNR, CONICET, Rosario, Argentina
Venue:
IBERAMIA'10 Proceedings of the 12th Ibero-American conference on Advances in artificial intelligence
Year:
2010

Citing 8
Cited 1

Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

Journal of Computational and Applied Mathematics
Path-Based Clustering for Grouping of Smooth Curves and Texture Segmentation

IEEE Transactions on Pattern Analysis and Machine Intelligence
Bagging for Path-Based Clustering

IEEE Transactions on Pattern Analysis and Machine Intelligence
A Similarity-Based Robust Clustering Method

IEEE Transactions on Pattern Analysis and Machine Intelligence
Kernel Methods for Pattern Analysis

Kernel Methods for Pattern Analysis
General C-Means Clustering Model

IEEE Transactions on Pattern Analysis and Machine Intelligence
Fast Agglomerative Clustering Using a k-Nearest Neighbor Graph

IEEE Transactions on Pattern Analysis and Machine Intelligence
Survey of clustering algorithms

IEEE Transactions on Neural Networks

Improved gene expression clustering with the parameter-free PKNNG metric

BSB'11 Proceedings of the 6th Brazilian conference on Advances in bioinformatics and computational biology

Quantified Score

Hi-index	0.00

Visualization

Abstract

Clustering is one of the most used tools for data analysis. Unfortunately, most methods suffer from a lack of performance when dealing with high dimensional spaces. Recently, some works showed evidence that the use of graph-based metrics can moderate this problem. In particular, the Penalized K-Nearest Neighbour Graph metric (PKNNG) showed good results in several situations. In this work we propose two improvements to this metric that makes it suitable for application to very different domains. First, we introduce an appropriate way to manage outliers, a typical problem in graph-based metrics. Then, we propose a simple method to select an optimal value of K, the number of neighbours considered in the k-nn graph. We analyze the proposed modifications using both artificial and real data, finding strong evidence that supports our improvements. Then we compare our new method to other graph based metrics, showing that it achieves a good performance on high dimensional datasets coming from very different domains, including DNA microarrays and face and digits image recognition problems.