Integration of heterogeneous databases without common domains using queries based on textual similarity

Authors:
William W. Cohen
Affiliations:
AT&T Labs-Research, 180 Park Avenue, Florham Park NJ
Venue:
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Year:
1998

Citing 26
Cited 133

Principles of artificial intelligence

Principles of artificial intelligence
Automatic text processing

Automatic text processing
Representation and learning in information retrieval

Representation and learning in information retrieval
Linear-space best-first search

Artificial Intelligence
SPIDER: a multiuser information retrieval system for semistructured and dynamic data

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Automatic combination of multiple ranked retrieval systems

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Probabilistic Datalog—a logic for powerful retrieval methods

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
The merge/purge problem for large databases

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Join queries with external text sources: execution and optimization techniques

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
Context-sensitive learning methods for text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
InfoSleuth: agent-based semantic integration of information in open and dynamic environments

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The distributed information search component (Disco) and the World Wide Web

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms

The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
Answering recursive queries using views

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Regular path queries with constraints

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Formal models of Web queries

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
A Web-based information system that reasons with structured collections of text

AGENTS '98 Proceedings of the second international conference on Autonomous agents
Learning to order things

NIPS '97 Proceedings of the 1997 conference on Advances in neural information processing systems 10
Query planning in infomaster

SAC '97 Proceedings of the 1997 ACM symposium on Applied computing
The Management of Probabilistic Data

IEEE Transactions on Knowledge and Data Engineering
Learning Logical Definitions from Relations

Machine Learning
W3QS: A Query System for the World-Wide Web

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Querying Heterogeneous Information Sources Using Source Descriptions

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Query Decomposition and View Maintenance for Query Languages for Unstructured Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Query-answering algorithms for information agents

AAAI'96 Proceedings of the thirteenth national conference on Artificial intelligence - Volume 1

Providing database-like access to the Web using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Modeling Web sources for information integration

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
An adaptive query execution system for data integration

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
A demonstration of WHIRL (demonstration abstract)

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Recognizing structure in Web pages using similarity queries

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Navigational plans for data integration

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
An efficient plan execution system for information management agents

Proceedings of the 2nd international workshop on Web information and data management
Query containment for data integration systems

PODS '00 Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Snowball: extracting relations from large plain-text collections

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Dataflow plan execution for software agents

AGENTS '00 Proceedings of the fourth international conference on Autonomous agents
An investigation of linguistic features and clustering algorithms for topical document clustering

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Hardening soft information sources

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Data integration using similarity joins and a word-based information representation language

ACM Transactions on Information Systems (TOIS)
Expressive retrieval from XML documents

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Discovering unexpected information from your competitors' web sites

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
An expressive and efficient language for XML information retrieval

Journal of the American Society for Information Science and Technology - XML
Data integration: a theoretical perspective

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Minimal probing: supporting expensive predicates for top-k queries

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Logic-based techniques in data integration

Logic-based artificial intelligence
A fast filtering scheme for large database cleansing

Proceedings of the eleventh international conference on Information and knowledge management
Reasoning about Textual Similarity in a Web-Based Information Access System

Autonomous Agents and Multi-Agent Systems
A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases

IEEE Transactions on Knowledge and Data Engineering
Information Integration

IEEE Intelligent Systems
Gleaning the Web

IEEE Intelligent Systems
An Information Retrieval Approach for Approximate Queries

IEEE Transactions on Knowledge and Data Engineering
A Data Model for Semistructured Data with Partial and Inconsistent Information

EDBT '00 Proceedings of the 7th International Conference on Extending Database Technology: Advances in Database Technology
The Index-Based XXL Search Engine for Querying XML Data with Relevance Ranking

EDBT '02 Proceedings of the 8th International Conference on Extending Database Technology: Advances in Database Technology
Declarative Data Cleaning: Language, Model, and Algorithms

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate String Joins in a Database (Almost) for Free

Proceedings of the 27th International Conference on Very Large Data Bases
Architecture of a Blended-Query and Result-Visualization Mechanism for Web-Accessible Databases and Associated Implementation Issues

ADBIS '02 Proceedings of the 6th East European Conference on Advances in Databases and Information Systems
Managing Web Data through Views

EC-Web 2001 Proceedings of the Second International Conference on Electronic Commerce and Web Technologies
On Real-Time Top k Querying for Mobile Services

On the Move to Meaningful Internet Systems, 2002 - DOA/CoopIS/ODBASE 2002 Confederated International Conferences DOA, CoopIS and ODBASE 2002
The Web in 2010: Challenges and Opportunities for Database Research

Informatics - 10 Years Back. 10 Years Ahead.
Property-Based Semantic Reconciliation of Heterogeneous Information Sources

ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
Extracting Information from the Web for Concept Learning and Collaborative Filtering

ALT '00 Proceedings of the 11th International Conference on Algorithmic Learning Theory
WrapIt: Automated Integration of Web Databases with Extensional Overlaps

Revised Papers from the NODe 2002 Web and Database-Related Workshops on Web, Web-Services, and Database Systems
Towards semistructured data integration

Web-enabled systems integration
XML schema integration to facilitate E-commerce

Web-enabled systems integration
Mediation in a dynamic context: arguing for a request-oriented approach and structuring it

Web-enabled systems integration
A survey of approaches to automatic schema matching

The VLDB Journal — The International Journal on Very Large Data Bases
Extracting information from heterogeneous information sources using ontologically specified target views

Information Systems
Web data retrieval and extraction

Data & Knowledge Engineering - Special issue: Data integration over the Web
Learning domain-independent string transformation weights for high accuracy object identification

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Text joins in an RDBMS for web data integration

WWW '03 Proceedings of the 12th international conference on World Wide Web
Interactive example-driven integration and reconciliation for accessing database federations

Information Systems
Query containment for data integration systems

Journal of Computer and System Sciences - Special issue on PODS 2000
Integration of Semistructured Data with Partial and Inconsistent Information

IDEAS '99 Proceedings of the 1999 International Symposium on Database Engineering & Applications
Querying structured text in an XML database

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Statistical schema matching across web query interfaces

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Robust and efficient fuzzy match for online data cleaning

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Instance-based attribute identification in database integration

The VLDB Journal — The International Journal on Very Large Data Bases
Answering imprecise database queries: a novel approach

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
On Automated Lesson Construction from Electronic Textbooks

IEEE Transactions on Knowledge and Data Engineering
Efficient similarity-based operations for data integration

Data & Knowledge Engineering
Using methods of declarative logic programming for intelligent information agents

Theory and Practice of Logic Programming
Texquery: a full-text search extension to xquery

Proceedings of the 13th international conference on World Wide Web
Providing ranked relevant results for web database queries

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Mining approximate functional dependencies and concept similarities to answer imprecise queries

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Querying web metadata: Native score management and text support in databases

ACM Transactions on Database Systems (TODS)
Schema Matching Using Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Robust Identification of Fuzzy Duplicates

ICDE '05 Proceedings of the 21st International Conference on Data Engineering
Approximate matching of textual domain attributes for information source integration

Proceedings of the 2nd international workshop on Information quality in information systems
Semantic-integration research in the database community

AI Magazine - Special issue on semantic integration
Establishing value mappings using statistical models and user feedback

Proceedings of the 14th ACM international conference on Information and knowledge management
Report on the DB/IR panel at SIGMOD 2005

ACM SIGMOD Record
Automatic Generation and Publication of Web Services for the Access and Integration of Distributed Data Sources

ENC '05 Proceedings of the Sixth Mexican International Conference on Computer Science
Adaptive Name Matching in Information Integration

IEEE Intelligent Systems
Profile-Based Object Matching for Information Integration

IEEE Intelligent Systems
Domain-independent data cleaning via analysis of entity-relationship graph

ACM Transactions on Database Systems (TODS)
XQuery full-text extensions explained

IBM Systems Journal
Efficient exact set-similarity joins

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Probabilistic information retrieval approach for ranking of database query results

ACM Transactions on Database Systems (TODS)
Supporting stratum access for fuzzy queries

DBA'06 Proceedings of the 24th IASTED international conference on Database and applications
Query result ranking over e-commerce web databases

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Duplicate Record Detection: A Survey

IEEE Transactions on Knowledge and Data Engineering
DB&IR: both sides now

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Benchmarking declarative approximate selection predicates

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Towards automated record linkage

AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Sideway value algebra for object-relational databases

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Eliminating fuzzy duplicates in data warehouses

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
The denodo data integration platform

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Wise-integrator: an automatic integrator of web search interfaces for E-commerce

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
Merging the results of approximate match operations

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient query evaluation on probabilistic databases

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Probabilistic ranking of database query results

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Probabilistic correlation-based similarity measure of unstructured records

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
VGRAM: improving performance of approximate queries on string collections using variable-length grams

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Example-driven design of efficient record matching queries

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Ontology-Based Data Sharing in P2P Databases

Semantic Web, Ontologies and Databases
Learning to create data-integrating queries

Proceedings of the VLDB Endowment
Theories of meaning in schema matching: An exploratory study

Information Systems
GrouPeer: Dynamic clustering of P2P databases

Information Systems
Database and information-retrieval methods for knowledge discovery

Communications of the ACM - A Direct Path to Dependable Software
Performance evaluation of similarity join for real time information integration

Proceedings of the 2nd Bangalore Annual Compute Conference
Constraint-based entity matching

AAAI'05 Proceedings of the 20th national conference on Artificial intelligence - Volume 2
Technical paper recommendation: a study in combining multiple information sources

Journal of Artificial Intelligence Research
Integrating background knowledge into text classification

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Language-model-based ranking for queries on RDF-graphs

Proceedings of the 18th ACM conference on Information and knowledge management
A possibilistic approach to string comparison

IEEE Transactions on Fuzzy Systems
HAMSTER: using search clicklogs for schema and taxonomy matching

Proceedings of the VLDB Endowment
Using similarity-based operations for resolving data-level conflicts

BNCOD'03 Proceedings of the 20th British national conference on Databases
SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement

Proceedings of the 19th international conference on World wide web
Reusing classical query rewriting in P2P databases

DBISP2P'05/06 Proceedings of the 2005/2006 international conference on Databases, information systems, and peer-to-peer computing
Similarity joins of text with incomplete information formats

DASFAA'07 Proceedings of the 12th international conference on Database systems for advanced applications
The fundamentals of iSPARQL: a virtual triple approach for similarity-based semantic web tasks

ISWC'07/ASWC'07 Proceedings of the 6th international The semantic web and 2nd Asian conference on Asian semantic web conference
Combining artificial intelligence and databases for data integration

Artificial intelligence today
Sampling dirty data for matching attributes

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Automatically incorporating new sources in keyword search-based data integration

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Properties of possibilistic string comparison

IEEE Transactions on Fuzzy Systems
Detecting data misuse by applying context-based data linkage

Proceedings of the 2010 ACM workshop on Insider threats
Generalizing prefix filtering to improve set similarity joins

Information Systems
Efficient set-correlation operator inside databases

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Rationality of cross-system data duplication: a case study

CAiSE'10 Proceedings of the 22nd international conference on Advanced information systems engineering
Exploiting content redundancy for web information extraction

Proceedings of the VLDB Endowment
Advanced quality prediction model for software architectural knowledge sharing

Journal of Systems and Software
Approximate entity extraction in temporal databases

World Wide Web
Factal: integrating deep web based on trust and relevance

Proceedings of the 20th international conference companion on World wide web
SourceRank: relevance and trust assessment for deep web sources based on inter-source agreement

Proceedings of the 20th international conference on World wide web
Sharing work in keyword search over databases

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Parallelizing large-scale data processing applications with data skew: a case study in product-offer matching

Proceedings of the second international workshop on MapReduce and its applications
Matching unstructured product offers to structured product specifications

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Efficient top-K approximate searches against a relation with multiple attributes

World Wide Web
Using knowledge integration techniques for user profile adaptation method in document retrieval systems

Transactions on computational collective intelligence V
Context-based entity description rule for entity resolution

Proceedings of the 20th ACM international conference on Information and knowledge management
Identifying value mappings for data integration: an unsupervised approach

WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Estimating recall and precision for vague queries in databases

CAiSE'05 Proceedings of the 17th international conference on Advanced Information Systems Engineering
Survey: An overview on XML similarity: Background, current trends and future directions

Computer Science Review
A machine learning approach for instance matching based on similarity metrics

ISWC'12 Proceedings of the 11th international conference on The Semantic Web - Volume Part I
TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration

Proceedings of the sixth ACM international conference on Web search and data mining
Assessing relevance and trust of the deep web sources and results based on inter-source agreement

ACM Transactions on the Web (TWEB)
Similarity evaluation in XML schema and XLink

Proceedings of the 19th Brazilian symposium on Multimedia and the web
Editorial: Efficient discovery of similarity constraints for matching dependencies

Data & Knowledge Engineering
Hybrid entity clustering using crowds and data

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most databases contain “name constants” like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIRL is much faster than naive inference methods, even for short queries. We also show that inferences made by WHIRL are surprisingly accurate, equaling the accuracy of hand-coded normalization routines on one benchmark problem, and outperforming exact matching with a plausible global domain on a second.