Incremental similarity joins with edit distance constraints

  • Authors:
  • Dongbo Dai;Gang Zhao

  • Affiliations:
  • Fudan University, Shanghai, China;Fudan University, Shanghai, China

  • Venue:
  • Proceedings of the 18th ACM conference on Information and knowledge management
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

With the dynamic increase of string data and the need to integrate data from multiple data sources, a challenging issue is to perform similarity joins on dynamically-augmented string sets. Existing methods only exploit domain-oriented filters to speed up join processing on static datasets, which are inefficient for incremental data-generation scenarios. In this paper, an efficient approach called ISJ-ED (abbr for Incremental Similarity Joins with Edit Distance constraints) is proposed to tackle similarity join problem on ever-growing string sets. We first design a distance-based filtering technique which exploits an incrementally-built index to improve the filtering capability. Then, for the existent filters, we study the impact of their executing orders on total filtering cost and suggest dynamically-optimized filtering orders. All these optimization strategies work jointly with the existing domain-oriented filters in ISJ-ED, that is, they are complementary to those filter-based methods with edit-distance thresholds. Experimental results demonstrate that on dynamically augmented string sets, our method is more efficient than those only leverage domain-oriented filters with a fixed filtering order.