PG-Skip: proximity graph based clustering of long strings

Authors:
Michail Kazimianec;Nikolaus Augsten
Affiliations:
Faculty of Computer Science, Free University of Bozen-Bolzano, Bozen, Italy;Faculty of Computer Science, Free University of Bozen-Bolzano, Bozen, Italy
Venue:
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
Year:
2011

Citing 12
Cited 1

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
A Comparison of Standard Spell Checking Algorithms and a Novel Binary Neural Approach

IEEE Transactions on Knowledge and Data Engineering
Efficient set joins on similarity predicates

SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Approximate Joins for Data-Centric XML

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Clustering of Short Strings in Large Databases

DEXA '09 Proceedings of the 2009 20th International Workshop on Database and Expert Systems Application
Exact and efficient proximity graph computation

ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems

PG-join: proximity graph based string similarity joins

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management

Quantified Score

Hi-index	0.00

Visualization

Abstract

String data is omnipresent and appears in a wide range of applications. Often string data must be partitioned into clusters of similar strings, for example, for cleansing noisy data. A promising string clustering approach is the recently proposed Graph Proximity Cleansing (GPC). A distinguishing feature of GPC is that it automatically detects the cluster borders without knowledge about the underlying data, using the so-called proximity graph. Unfortunately, the computation of the proximity graph is expensive. In particular, the runtime is high for long strings, thus limiting the application of the state-of-the-art GPC algorithm to short strings. In this work we present two algorithms, PG-Skip and PG-Binary, that efficiently compute the GPC cluster borders and scale to long strings. PG-Skip follows a prefix pruning strategy and does not need to compute the full proximity graph to detect the cluster border. PG-Skip is much faster than the state-of-the-art algorithm, especially for long strings, and computes the exact GPC borders. We show the optimality of PG-Skip among all prefix pruning algorithms. PG-Binary is an efficient approximation algorithm, which uses a binary search strategy to detect the cluster border. Our extensive experiments on synthetic and real-world data confirm the scalability of PG-Skip and show that PG-Binary approximates the GPC clusters very effectively.