PG-Skip: proximity graph based clustering of long strings

  • Authors:
  • Michail Kazimianec;Nikolaus Augsten

  • Affiliations:
  • Faculty of Computer Science, Free University of Bozen-Bolzano, Bozen, Italy;Faculty of Computer Science, Free University of Bozen-Bolzano, Bozen, Italy

  • Venue:
  • DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

String data is omnipresent and appears in a wide range of applications. Often string data must be partitioned into clusters of similar strings, for example, for cleansing noisy data. A promising string clustering approach is the recently proposed Graph Proximity Cleansing (GPC). A distinguishing feature of GPC is that it automatically detects the cluster borders without knowledge about the underlying data, using the so-called proximity graph. Unfortunately, the computation of the proximity graph is expensive. In particular, the runtime is high for long strings, thus limiting the application of the state-of-the-art GPC algorithm to short strings. In this work we present two algorithms, PG-Skip and PG-Binary, that efficiently compute the GPC cluster borders and scale to long strings. PG-Skip follows a prefix pruning strategy and does not need to compute the full proximity graph to detect the cluster border. PG-Skip is much faster than the state-of-the-art algorithm, especially for long strings, and computes the exact GPC borders. We show the optimality of PG-Skip among all prefix pruning algorithms. PG-Binary is an efficient approximation algorithm, which uses a binary search strategy to detect the cluster border. Our extensive experiments on synthetic and real-world data confirm the scalability of PG-Skip and show that PG-Binary approximates the GPC clusters very effectively.