Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
A Primitive Operator for Similarity Joins in Data Cleaning
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Sequence Data Mining (Advances in Database Systems)
Sequence Data Mining (Advances in Database Systems)
Efficient similarity joins for near duplicate detection
Proceedings of the 17th international conference on World Wide Web
Efficient Merging and Filtering Algorithms for Approximate String Searches
ICDE '08 Proceedings of the 2008 IEEE 24th International Conference on Data Engineering
Hi-index | 0.00 |
With the dynamic increase of string data and the need to integrate data from multiple data sources, a challenging issue is to perform similarity joins on dynamically-augmented string sets. Existing methods only exploit domain-oriented filters to speed up join processing on static datasets, which are inefficient for incremental data-generation scenarios. In this paper, an efficient approach called ISJ-ED (abbr for Incremental Similarity Joins with Edit Distance constraints) is proposed to tackle similarity join problem on ever-growing string sets. We first design a distance-based filtering technique which exploits an incrementally-built index to improve the filtering capability. Then, for the existent filters, we study the impact of their executing orders on total filtering cost and suggest dynamically-optimized filtering orders. All these optimization strategies work jointly with the existing domain-oriented filters in ISJ-ED, that is, they are complementary to those filter-based methods with edit-distance thresholds. Experimental results demonstrate that on dynamically augmented string sets, our method is more efficient than those only leverage domain-oriented filters with a fixed filtering order.