Performance evaluation of similarity join for real time information integration

Authors:
Manish Kumar;Shane Moriah;Srikumar Krishnamoorthy
Affiliations:
Purdue University, West Lafayette, IN;Stanford University, Palo Alto, California, CA;Infosys Technologies Ltd, Electronics City, Bangalore, India
Venue:
Proceedings of the 2nd Bangalore Annual Compute Conference
Year:
2009

Citing 13
Cited 0

Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The String-to-String Correction Problem

Journal of the ACM (JACM)
Item-based collaborative filtering recommendation algorithms

Proceedings of the 10th international conference on World Wide Web
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
On Approximate String Matching

Proceedings of the 1983 International FCT-Conference on Fundamentals of Computation Theory
Efficient similarity-based operations for data integration

Data & Knowledge Engineering
Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions

IEEE Transactions on Knowledge and Data Engineering
SIREN: a similarity retrieval engine for complex data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A new suffix tree similarity measure for document clustering

Proceedings of the 16th international conference on World Wide Web
A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem

Information Sciences: an International Journal
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Approximate join processing serves a key role in many application areas such as data cleansing, data integration, text mining, and bio-informatics. There has been much research interest in approximate join processing based on the concept of an edit distance metric. Approximate join processing algorithms generally use a variety of qgram based filtering techniques to improve the scalability of the system. The primary approach taken in the literature involves the exploitation of methods inside a particular database language. However, this is impractical in the case of heterogeneous data spread across multiple databases. A popular alternative approach involves the direct comparison of all permutations of two string pairings. However, such algorithms don't scale well for very large databases, even after applying qgram filters. Here we design a novel, stand-alone filtering technique, essentially a modification of the HashJoin algorithm, to improve the scalability of similarity join processing algorithms. We implement the algorithm and conduct a number of experiments to study the performance of the system. The presented algorithm is also integrated with a real-life data federation solution called Infosys Gradient. The paper presents the performance results on a real-life test bed.