Performance evaluation of similarity join for real time information integration

  • Authors:
  • Manish Kumar;Shane Moriah;Srikumar Krishnamoorthy

  • Affiliations:
  • Purdue University, West Lafayette, IN;Stanford University, Palo Alto, California, CA;Infosys Technologies Ltd, Electronics City, Bangalore, India

  • Venue:
  • Proceedings of the 2nd Bangalore Annual Compute Conference
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Approximate join processing serves a key role in many application areas such as data cleansing, data integration, text mining, and bio-informatics. There has been much research interest in approximate join processing based on the concept of an edit distance metric. Approximate join processing algorithms generally use a variety of qgram based filtering techniques to improve the scalability of the system. The primary approach taken in the literature involves the exploitation of methods inside a particular database language. However, this is impractical in the case of heterogeneous data spread across multiple databases. A popular alternative approach involves the direct comparison of all permutations of two string pairings. However, such algorithms don't scale well for very large databases, even after applying qgram filters. Here we design a novel, stand-alone filtering technique, essentially a modification of the HashJoin algorithm, to improve the scalability of similarity join processing algorithms. We implement the algorithm and conduct a number of experiments to study the performance of the system. The presented algorithm is also integrated with a real-life data federation solution called Infosys Gradient. The paper presents the performance results on a real-life test bed.