Clustering with Proximity Graphs: Exact and Efficient Algorithms

  • Authors:
  • Michail Kazimianec;Nikolaus Augsten

  • Affiliations:
  • Faculty of Economics, Vilnius University, Vilnius, Lithuania;Faculty of Computer Science, Free University of Bozen-Bolzano, Bozen-Bolzano, Italy

  • Venue:
  • International Journal of Knowledge-Based Organizations
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Graph Proximity Cleansing GPC is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. However, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms only approximate the proximity graph using a sampling technique. Further, the quality of GPC clusters has never been compared to standard clustering techniques like k-means, density-based, or hierarchical clustering. In this article the authors propose two efficient algorithms, PG-DS and PG-SM, for the exact computation of proximity graphs. The authors experimentally show that our solutions are faster even if the sampling-based algorithms use very small sample sizes. The authors provide a thorough experimental evaluation of GPC and conclude that it is very efficient and shows good clustering quality in comparison to the standard techniques. These results open a new perspective on string clustering in settings, where no knowledge about the input data is available.