Web data integration using approximate string join

Authors:
Yingping Huang;Gregory Madey
Affiliations:
University of Notre Dame, Notre Dame, IN;University of Notre Dame, Notre Dame, IN
Venue:
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Year:
2004

Citing 5
Cited 3

Incremental distance join algorithms for spatial databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Embeddings and non-approximability of geometric problems

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Efficient Record Linkage in Large Data Sets

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications

SlideSeer: a digital library of aligned document and presentation pairs

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Estimating the selectivity of tf-idf based cosine similarity predicates

ACM SIGMOD Record
Estimating the selectivity of tf-idf based cosine similarity predicates

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

Web data integration is an important preprocessing step for web mining. It is highly likely that several records on the web whose textual representations differ may represent the same real world entity. These records are called approximate duplicates. Data integration seeks to identify such approximate duplicates and merge them into integrated records. Many existing data integration algorithms make use of approximate string join, which seeks to (approximately) find all pairs of strings whose distances are less than a certain threshold. In this paper, we propose a new mapping method to detect pairs of strings with similarity above a certain threshold. In our method, each string is first mapped to a point in a high dimensional grid space, then pairs of points whose distances are 1 are identified. We implement it using Oracle SQL and PL/SQL. Finally, we evaluate this method using real data sets. Experimental results suggest that our method is both accurate and efficient.