Improving k-means by outlier removal

Authors:
Ville Hautamäki;Svetlana Cherednichenko;Ismo Kärkkäinen;Tomi Kinnunen;Pasi Fränti
Affiliations:
Speech and Image Processing Unit, Department of Computer Science, University of Joensuu, Joensuu, Finland;Speech and Image Processing Unit, Department of Computer Science, University of Joensuu, Joensuu, Finland;Speech and Image Processing Unit, Department of Computer Science, University of Joensuu, Joensuu, Finland;Speech and Image Processing Unit, Department of Computer Science, University of Joensuu, Joensuu, Finland;Speech and Image Processing Unit, Department of Computer Science, University of Joensuu, Joensuu, Finland
Venue:
SCIA'05 Proceedings of the 14th Scandinavian conference on Image Analysis
Year:
2005

Citing 4
Cited 0

CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
BIRCH: A New Data Clustering Algorithm and Its Applications

Data Mining and Knowledge Discovery
ROCK: A Robust Clustering Algorithm for Categorical Attributes

ICDE '99 Proceedings of the 15th International Conference on Data Engineering
Outlier Detection Using k-Nearest Neighbour Graph

ICPR '04 Proceedings of the Pattern Recognition, 17th International Conference on (ICPR'04) Volume 3 - Volume 03

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an Outlier Removal Clustering (ORC) algorithm that provides outlier detection and data clustering simultaneously. The method employs both clustering and outlier discovery to improve estimation of the centroids of the generative distribution. The proposed algorithm consists of two stages. The first stage consist of purely K-means process, while the second stage iteratively removes the vectors which are far from their cluster centroids. We provide experimental results on three different synthetic datasets and three map images which were corrupted by lossy compression. The results indicate that the proposed method has a lower error on datasets with overlapping clusters than the competing methods.