The merge/purge problem for large databases
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Integrating structured data and text: a relational approach
Journal of the American Society for Information Science
Block edit models for approximate string matching
Theoretical Computer Science - Special issue: Latin American theoretical informatics
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Journal of the American Society for Information Science
Approximating matrix multiplication for pattern recognition tasks
SODA '97 Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms
Managing gigabytes (2nd ed.): compressing and indexing documents and images
Managing gigabytes (2nd ed.): compressing and indexing documents and images
A guided tour to approximate string matching
ACM Computing Surveys (CSUR)
Vector-space ranking with effective early termination
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Static index pruning for information retrieval systems
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Computing Iceberg Queries Efficiently
VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Declarative Data Cleaning: Language, Model, and Algorithms
Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free
Proceedings of the 27th International Conference on Very Large Data Bases
An Evaluation of Non-Equijoin Algorithms
VLDB '91 Proceedings of the 17th International Conference on Very Large Data Bases
Interactive deduplication using active learning
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning domain-independent string transformation weights for high accuracy object identification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning to match and cluster large high-dimensional data sets for data integration
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Eliminating fuzzy duplicates in data warehouses
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Web data integration using approximate string join
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Measuring similarity between collection of values
Proceedings of the 6th annual ACM international workshop on Web information and data management
Schema Matching Using Duplicates
ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Comparative study of name disambiguation problem using a scalable blocking-based framework
Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
SPIDER: flexible matching in databases
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Blocking-aware private record linkage
Proceedings of the 2nd international workshop on Information quality in information systems
Effective and scalable solutions for mixed and split citation problems in digital libraries
Proceedings of the 2nd international workshop on Information quality in information systems
Using SPIDER: an experience report
Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Multi-column substring matching for database schema translation
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Duplicate Record Detection: A Survey
IEEE Transactions on Knowledge and Data Engineering
Data quality awareness: a case study for cost optimal association rule mining
Knowledge and Information Systems - Special Issue on Mining Low-Quality Data
Benchmarking declarative approximate selection predicates
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Adaptive sorted neighborhood methods for efficient record linkage
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
Merging the results of approximate match operations
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Proceedings of the 9th annual ACM international workshop on Web information and data management
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
A strategy for allowing meaningful and comparable scores in approximate matching
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Probabilistic correlation-based similarity measure of unstructured records
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Estimating the selectivity of tf-idf based cosine similarity predicates
ACM SIGMOD Record
Estimating the selectivity of tf-idf based cosine similarity predicates
ACM SIGMOD Record
Evaluating Performance and Quality of XML-Based Similarity Joins
ADBIS '08 Proceedings of the 12th East European conference on Advances in Databases and Information Systems
Learning to create data-integrating queries
Proceedings of the VLDB Endowment
Automatic threshold estimation for data matching applications
SBBD '08 Proceedings of the 23rd Brazilian symposium on Databases
Keyword search over relational tables and streams
ACM Transactions on Database Systems (TODS)
A strategy for allowing meaningful and comparable scores in approximate matching
Information Systems
A strategy for allowing meaningful and comparable scores in approximate matching
Information Systems
Record linkage performance for large data sets
Proceedings of the ACM first international workshop on Privacy and anonymity for very large databases
A possibilistic approach to string comparison
IEEE Transactions on Fuzzy Systems
HARRA: fast iterative hashed record linkage for large-scale data collections
Proceedings of the 13th International Conference on Extending Database Technology
Exploiting content redundancy for web information extraction
Proceedings of the 19th international conference on World wide web
Similarity joins of text with incomplete information formats
DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
The fundamentals of iSPARQL: a virtual triple approach for similarity-based semantic web tasks
ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
Automatically incorporating new sources in keyword search-based data integration
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Properties of possibilistic string comparison
IEEE Transactions on Fuzzy Systems
Efficient set-correlation operator inside databases
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Prefix tree indexing for similarity search and similarity joins on genomic data
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
Exploiting content redundancy for web information extraction
Proceedings of the VLDB Endowment
Approximate entity extraction in temporal databases
World Wide Web
Automatic threshold estimation for data matching applications
Information Sciences: an International Journal
Sharing work in keyword search over databases
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
gStore: answering SPARQL queries via subgraph matching
Proceedings of the VLDB Endowment
Integrating data from maps on the world-wide web
W2GIS'06 Proceedings of the 6th international conference on Web and Wireless Geographical Information Systems
Effective early termination techniques for text similarity join operator
ISCIS'05 Proceedings of the 20th international conference on Computer and Information Sciences
Estimating recall and precision for vague queries in databases
CAiSE'05 Proceedings of the 17th international conference on Advanced Information Systems Engineering
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Efficient similarity search in very large string sets
SSDBM'12 Proceedings of the 24th international conference on Scientific and Statistical Database Management
De-duplication of aggregation authority files
International Journal of Metadata, Semantics and Ontologies
Actively soliciting feedback for query answers in keyword search-based data integration
Proceedings of the VLDB Endowment
De-duplication of aggregation authority files
International Journal of Metadata, Semantics and Ontologies
Editorial: Efficient discovery of similarity constraints for matching dependencies
Data & Knowledge Engineering
Linkage of compound objects for supporting maintenance of large-scale web sites
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Hi-index | 0.00 |
The integration of data produced and collected across autonomous, heterogeneous web services is an increasingly important and challenging problem. Due to the lack of global identifiers, the same entity (e.g., a product) might have different textual representations across databases. Textual data is also often noisy because of transcription errors, incomplete information, and lack of standard formats. A fundamental task during data integration is matching of strings that refer to the same entity. In this paper, we adopt the widely used and established cosine similarity metric from the information retrieval field in order to identify potential string matches across web sources. We then use this similarity metric to characterize this key aspect of data integration as a join between relations on textual attributes, where the similarity of matches exceeds a specified threshold. Computing an exact answer to the text join can be expensive. For query processing efficiency, we propose a sampling-based join approximation strategy for execution in a standard, unmodified relational database management system (RDBMS), since more and more web sites are powered by RDBMSs with a web-based front end. We implement the join inside an RDBMS, using SQL queries, for scalability and robustness reasons. Finally, we present a detailed performance evaluation of an implementation of our algorithm within a commercial RDBMS, using real-life data sets. Our experimental results demonstrate the efficiency and accuracy of our techniques.