Iterative random projections for high-dimensional data clustering

Authors:
Ângelo Cardoso;Andreas Wichert
Affiliations:
INESC-ID Lisboa and Instituto Superior Técnico, Technical University of Lisbon, Av. Prof. Dr. Aníbal Cavaco Silva, 2744-016 Porto Salvo, Portugal;INESC-ID Lisboa and Instituto Superior Técnico, Technical University of Lisbon, Av. Prof. Dr. Aníbal Cavaco Silva, 2744-016 Porto Salvo, Portugal
Venue:
Pattern Recognition Letters
Year:
2012

Citing 9
Cited 0

A simulated annealing algorithm for the clustering problem

Pattern Recognition
Computational experience on four algorithms for the hard clustering problem

Pattern Recognition Letters
An elementary proof of a theorem of Johnson and Lindenstrauss

Random Structures & Algorithms
Dimensionality Reductions That Preserve Volumes and Distance to Affine Spaces, and Their Algorithmic Applications

RANDOM '02 Proceedings of the 6th International Workshop on Randomization and Approximation Techniques
Database-friendly random projections: Johnson-Lindenstrauss with binary coins

Journal of Computer and System Sciences - Special issu on PODS 2001
A Simple Linear Time (1+ ") -Approximation Algorithm for k-Means Clustering in Any Dimensions

FOCS '04 Proceedings of the 45th Annual IEEE Symposium on Foundations of Computer Science
Very sparse random projections

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Experiments with random projection

UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
Least squares quantization in PCM

IEEE Transactions on Information Theory

Quantified Score

Hi-index	0.10

Visualization

Abstract

In this text we propose a method which efficiently performs clustering of high-dimensional data. The method builds on random projection and the K-means algorithm. The idea is to apply K-means several times, increasing the dimensionality of the data after each convergence of K-means. We compare the proposed algorithm on four high-dimensional datasets, image, text and two synthetic, with K-means clustering using a single random projection and K-means clustering of the original high-dimensional data. Regarding time we observe that the algorithm reduces drastically the time when compared to K-means on the original high-dimensional data. Regarding mean squared error the proposed method reaches a better solution than clustering using a single random projection. More notably in the experiments performed it also reaches a better solution than clustering on the original high-dimensional data.