SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The String-to-String Correction Problem
Journal of the ACM (JACM)
Item-based collaborative filtering recommendation algorithms
Proceedings of the 10th international conference on World Wide Web
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
On Approximate String Matching
Proceedings of the 1983 International FCT-Conference on Fundamentals of Computation Theory
Efficient similarity-based operations for data integration
Data & Knowledge Engineering
IEEE Transactions on Knowledge and Data Engineering
SIREN: a similarity retrieval engine for complex data
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A new suffix tree similarity measure for document clustering
Proceedings of the 16th international conference on World Wide Web
A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem
Information Sciences: an International Journal
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Ed-Join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment
Hi-index | 0.00 |
Approximate join processing serves a key role in many application areas such as data cleansing, data integration, text mining, and bio-informatics. There has been much research interest in approximate join processing based on the concept of an edit distance metric. Approximate join processing algorithms generally use a variety of qgram based filtering techniques to improve the scalability of the system. The primary approach taken in the literature involves the exploitation of methods inside a particular database language. However, this is impractical in the case of heterogeneous data spread across multiple databases. A popular alternative approach involves the direct comparison of all permutations of two string pairings. However, such algorithms don't scale well for very large databases, even after applying qgram filters. Here we design a novel, stand-alone filtering technique, essentially a modification of the HashJoin algorithm, to improve the scalability of similarity join processing algorithms. We implement the algorithm and conduct a number of experiments to study the performance of the system. The presented algorithm is also integrated with a real-life data federation solution called Infosys Gradient. The paper presents the performance results on a real-life test bed.