Partition clustering of high dimensional low sample size data based on p-values

Authors:
George von Borries;Haiyan Wang
Affiliations:
Departamento de Estatística, IE, Universidade de Brasília, 70910-900, DF, Brazil;Department of Statistics, Kansas State University, 66506-0802, KS, USA
Venue:
Computational Statistics & Data Analysis
Year:
2009

Citing 4
Cited 1

Self-organization and associative memory: 3rd edition

Self-organization and associative memory: 3rd edition
An empirical comparison of four initialization methods for the K-Means algorithm

Pattern Recognition Letters
Refining Initial Points for K-Means Clustering

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Cluster Analysis for Gene Expression Data: A Survey

IEEE Transactions on Knowledge and Data Engineering

Model-based clustering of high-dimensional data: A review

Computational Statistics & Data Analysis

Quantified Score

Hi-index	0.03

Visualization

Abstract

Clustering techniques play an important role in analyzing high dimensional data that is common in high-throughput screening such as microarray and mass spectrometry data. Effective use of the high dimensionality and some replications can help to increase clustering accuracy and stability. In this article a new partitioning algorithm with a robust distance measure is introduced to cluster variables in high dimensional low sample size (HDLSS) data that contain a large number of independent variables with a small number of replications per variable. The proposed clustering algorithm, PPCLUST, considers data from a mixture distribution and uses p-values from nonparametric rank tests of homogeneous distribution as a measure of similarity to separate the mixture components. PPCLUST is able to efficiently cluster a large number of variables in the presence of very few replications. Inherited from the robustness of rank procedure, the new algorithm is robust to outliers and invariant to monotone transformations of data. Numerical studies and an application to microarray gene expression data for colorectal cancer study are discussed.