Exact and efficient proximity graph computation

Authors:
Michail Kazimianec;Nikolaus Augsten
Affiliations:
Faculty of Computer Science, Free University of Bozen-Bolzano, Bozen, Italy;Faculty of Computer Science, Free University of Bozen-Bolzano, Bozen, Italy
Venue:
ADBIS'10 Proceedings of the 14th east European conference on Advances in databases and information systems
Year:
2010

Citing 10
Cited 2

Fast parallel and serial approximate string matching

Journal of Algorithms
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Introduction to Information Retrieval

Introduction to Information Retrieval
Approximate Joins for Data-Centric XML

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

PG-Skip: proximity graph based clustering of long strings

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications: Part II
PG-join: proximity graph based string similarity joins

SSDBM'11 Proceedings of the 23rd international conference on Scientific and statistical database management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Graph Proximity Cleansing (GPC) is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. Unfortunately, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms only approximate the proximity graph using a sampling technique. In this paper we propose two efficient algorithms for the exact computation of proximity graphs. The first algorithm, PG-DS, is based on a divide-skip technique for merging inverted lists, the second algorithm, PG-SM, uses a sort-merge join strategy to compute the proximity graph. While the state-of-the-art solutions only approximate the correct proximity graph, our algorithms are exact. We experimentally evaluate our solution on large real world datasets and show that our algorithms are faster than the sampling-based approximation algorithms, even for very small sample sizes.