Incremental similarity joins with edit distance constraints

Authors:
Dongbo Dai;Gang Zhao
Affiliations:
Fudan University, Shanghai, China;Fudan University, Shanghai, China
Venue:
Proceedings of the 18th ACM conference on Information and knowledge management
Year:
2009

Citing 6
Cited 0

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
A Primitive Operator for Similarity Joins in Data Cleaning

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Sequence Data Mining (Advances in Database Systems)

Sequence Data Mining (Advances in Database Systems)
Efficient similarity joins for near duplicate detection

Proceedings of the 17th international conference on World Wide Web
Efficient Merging and Filtering Algorithms for Approximate String Searches

ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the dynamic increase of string data and the need to integrate data from multiple data sources, a challenging issue is to perform similarity joins on dynamically-augmented string sets. Existing methods only exploit domain-oriented filters to speed up join processing on static datasets, which are inefficient for incremental data-generation scenarios. In this paper, an efficient approach called ISJ-ED (abbr for Incremental Similarity Joins with Edit Distance constraints) is proposed to tackle similarity join problem on ever-growing string sets. We first design a distance-based filtering technique which exploits an incrementally-built index to improve the filtering capability. Then, for the existent filters, we study the impact of their executing orders on total filtering cost and suggest dynamically-optimized filtering orders. All these optimization strategies work jointly with the existing domain-oriented filters in ISJ-ED, that is, they are complementary to those filter-based methods with edit-distance thresholds. Experimental results demonstrate that on dynamically augmented string sets, our method is more efficient than those only leverage domain-oriented filters with a fixed filtering order.