Landmark-join: hash-join based string similarity joins with edit distance constraints

Authors:
Kazuyo Narita;Shinji Nakadai;Takuya Araki
Affiliations:
Cloud System Research Laboratories, NEC Corporation, Kawasaki, Kanagawa, Japan;Cloud System Research Laboratories, NEC Corporation, Kawasaki, Kanagawa, Japan;Cloud System Research Laboratories, NEC Corporation, Kawasaki, Kanagawa, Japan
Venue:
DaWaK'12 Proceedings of the 14th international conference on Data Warehousing and Knowledge Discovery
Year:
2012

Citing 14
Cited 0

Parallel GRACE Hash Join on Shared-Everything Multiprocessor: Implementation and Performance Evaluation on Symmetry S81

Proceedings of the Eighth International Conference on Data Engineering
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
An Evaluation of Non-Equijoin Algorithms

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
A Dynamic Edit Distance Table

COM '00 Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Scaling up all pairs similarity search

Proceedings of the 16th international conference on World Wide Web
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Efficient parallel set-similarity joins using MapReduce

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Proceedings of the VLDB Endowment
Pass-join: a partition-based method for similarity joins

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel data processing complicates the completion of string similarity joins because parallel data processing requires the use of a well designed data partitioning scheme. Moreover, efficient verification of string pairs is needed to speed up the entire string similarity join process. We propose a novel framework that addresses these requirements through the use of edit distance constraints. The Landmark-Join framework has two functions that reduce two kinds of search spaces. The first, q-bucket partitioning, reduces the number of verifications of dissimilar string pairs and lowers skewness among buckets. The second, local upper bound calculation, prunes the search space of edit distance to speed up each verification. Experimental results show that Landmark-Join has good parallel scalability and that the two proposed functions speed up the entire string similarity join process.