Text joins in an RDBMS for web data integration

Authors:
Luis Gravano;Panagiotis G. Ipeirotis;Nick Koudas;Divesh Srivastava
Affiliations:
Columbia University;Columbia University;AT&T Labs--Research;AT&T Labs--Research
Venue:
WWW '03 Proceedings of the 12th international conference on World Wide Web
Year:
2003

Citing 19
Cited 53

The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Integrating structured data and text: a relational approach

Journal of the American Society for Information Science
Block edit models for approximate string matching

Theoretical Computer Science - Special issue: Latin American theoretical informatics
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
A parallel relational database management system approach to relevance feedback in information retrieval

Journal of the American Society for Information Science
Approximating matrix multiplication for pattern recognition tasks

SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Managing gigabytes (2nd ed.): compressing and indexing documents and images

Managing gigabytes (2nd ed.): compressing and indexing documents and images
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Vector-space ranking with effective early termination

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Computing Iceberg Queries Efficiently

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
An Evaluation of Non-Equijoin Algorithms

VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Interactive deduplication using active learning

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases

Web data integration using approximate string join

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Measuring similarity between collection of values

Proceedings of the 6th annual ACM international workshop on Web information and data management
Schema Matching Using Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Comparative study of name disambiguation problem using a scalable blocking-based framework

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
SPIDER: flexible matching in databases

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Blocking-aware private record linkage

Proceedings of the 2nd international workshop on Information quality in information systems
Effective and scalable solutions for mixed and split citation problems in digital libraries

Proceedings of the 2nd international workshop on Information quality in information systems
Using SPIDER: an experience report

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Multi-column substring matching for database schema translation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
Data quality awareness: a case study for cost optimal association rule mining

Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Benchmarking declarative approximate selection predicates

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Adaptive sorted neighborhood methods for efficient record linkage

Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Merging the results of approximate match operations

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
FASE: A Framework for Scalable Performance Prediction of HPC Systems and Applications

Simulation
Web based linkage

Proceedings of the 9th annual ACM international workshop on Web information and data management
Parallel linkage

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A strategy for allowing meaningful and comparable scores in approximate matching

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Probabilistic correlation-based similarity measure of unstructured records

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Estimating the selectivity of tf-idf based cosine similarity predicates

ACM SIGMOD Record
Estimating the selectivity of tf-idf based cosine similarity predicates

ACM SIGMOD Record
Evaluating Performance and Quality of XML-Based Similarity Joins

ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
Learning to create data-integrating queries

Proceedings of the VLDB Endowment
Automatic threshold estimation for data matching applications

SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Keyword search over relational tables and streams

ACM Transactions on Database Systems (TODS)
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
A strategy for allowing meaningful and comparable scores in approximate matching

Information Systems
Record linkage performance for large data sets

Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
A possibilistic approach to string comparison

IEEE Transactions on Fuzzy Systems
HARRA: fast iterative hashed record linkage for large-scale data collections

Proceedings of the 13th International Conference on Extending Database Technology
Exploiting content redundancy for web information extraction

Proceedings of the 19th international conference on World wide web
Similarity joins of text with incomplete information formats

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
The fundamentals of iSPARQL: a virtual triple approach for similarity-based semantic web tasks

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
Automatically incorporating new sources in keyword search-based data integration

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Properties of possibilistic string comparison

IEEE Transactions on Fuzzy Systems
Efficient set-correlation operator inside databases

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Prefix tree indexing for similarity search and similarity joins on genomic data

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Exploiting content redundancy for web information extraction

Proceedings of the VLDB Endowment
Approximate entity extraction in temporal databases

World Wide Web
Automatic threshold estimation for data matching applications

Information Sciences: an International Journal
Sharing work in keyword search over databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
gStore: answering SPARQL queries via subgraph matching

Proceedings of the VLDB Endowment
Efficient top-K approximate searches against a relation with multiple attributes

World Wide Web
Integrating data from maps on the world-wide web

W2GIS'06 Proceedings of the 6th international conference on Web and Wireless Geographical Information Systems
Effective early termination techniques for text similarity join operator

ISCIS'05 Proceedings of the 20th international conference on Computer and Information Sciences
Estimating recall and precision for vague queries in databases

CAiSE'05 Proceedings of the 17th international conference on Advanced Information Systems Engineering
Mob data sourcing

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Efficient similarity search in very large string sets

SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
Actively soliciting feedback for query answers in keyword search-based data integration

Proceedings of the VLDB Endowment
De-duplication of aggregation authority files

International Journal of Metadata, Semantics and Ontologies
Editorial: Efficient discovery of similarity constraints for matching dependencies

Data & Knowledge Engineering
Linkage of compound objects for supporting maintenance of large-scale web sites

Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and challenging problem. Due to the lack of global identifiers, the same entity (e.g., a product) might have different textual representations across databases. Textual data is also often noisy because of transcription errors, incomplete information, and lack of standard formats. A fundamental task during data integration is matching of strings that refer to the same entity. In this paper, we adopt the widely used and established cosine similarity metric from the information retrieval field in order to identify potential string matches across web sources. We then use this similarity metric to characterize this key aspect of data integration as a join between relations on textual attributes, where the similarity of matches exceeds a specified threshold. Computing an exact answer to the text join can be expensive. For query processing efficiency, we propose a sampling-based join approximation strategy for execution in a standard, unmodified relational database management system (RDBMS), since more and more web sites are powered by RDBMSs with a web-based front end. We implement the join inside an RDBMS, using SQL queries, for scalability and robustness reasons. Finally, we present a detailed performance evaluation of an implementation of our algorithm within a commercial RDBMS, using real-life data sets. Our experimental results demonstrate the efficiency and accuracy of our techniques.