Reasoning about Textual Similarity in a Web-Based Information Access System

Authors:
William W. Cohen
Affiliations:
AT&T Labs, Research, 180 Park Avenue, Florham Park, NJ 07932 wcohen@research.att.com
Venue:
Autonomous Agents and Multi-Agent Systems
Year:
1999

Citing 16
Cited 7

Automatic text processing

Automatic text processing
Probabilistic Datalog—a logic for powerful retrieval methods

SIGIR '95 Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval
Query evaluation: strategies and optimizations

Information Processing and Management: an International Journal
InfoSleuth: agent-based semantic integration of information in open and dynamic environments

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
Infomaster: an information integration system

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
The distributed information search component (Disco) and the World Wide Web

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
A first course in database systems

A first course in database systems
Regular path queries with constraints

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Formal models of Web queries

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Integration of heterogeneous databases without common domains using queries based on textual similarity

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The Araneus Web-based management system

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
User-oriented smart-cache for the Web: what you seek is what you get!

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
The Management of Probabilistic Data

IEEE Transactions on Knowledge and Data Engineering
W3QS: A Query System for the World-Wide Web

VLDB '95 Proceedings of the 21th International Conference on Very Large Data Bases
Querying Heterogeneous Information Sources Using Source Descriptions

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Query Decomposition and View Maintenance for Query Languages for Unstructured Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases

Searching web databases by structuring keyword-based queries

Proceedings of the eleventh international conference on Information and knowledge management
Integrating Information Visualization and Retrieval for Discovering Internet Sources

DS '00 Proceedings of the Third International Conference on Discovery Science
Integrating information visualization and retrieval for WWW information discovery

Theoretical Computer Science
Learning to match and cluster large high-dimensional data sets for data integration

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Keyword-based queries over web databases

Effective databases for text & document management
A Bayesian network approach to searching Web databases through keyword-based queries

Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
Adapting Web information extraction knowledge via mining site-invariant and site-dependent features

ACM Transactions on Internet Technology (TOIT)

Quantified Score

Hi-index	0.00

Visualization

Abstract

The degree to which information sources are pre-processed by Web-based information systems varies greatly. In search engines like Altavista, little pre-processing is done, while in “knowledge integration” systems, complex site-specific “wrappers” are used to integrate different information sources into a common database representation. In this paper we describe an intermediate point between these two models. In our system, information sources are converted into a highly structured collection of small fragments of text. Database-like queries to this structured collection of text fragments are approximated using a novel logic called WHIRL, which combines inference in the style of deductive databases with ranked retrieval methods from information retrieval (IR). WHIRL allows queries that integrate information from multiple Web sites, without requiring the extraction and normalization of object identifiers that can be used as keys; instead, operations that in conventional databases require equality tests on keys are approximated using IR similarity metrics for text. This leads to a reduction in the amount of human engineering required to field a knowledge integration system. Experimental evidence is given showing that many information sources can be easily modeled with WHIRL, and that inferences in the logic are both accurate and efficient.