Reasoning about Textual Similarity in a Web-Based Information Access System

  • Authors:
  • William W. Cohen

  • Affiliations:
  • AT&T Labs, Research, 180 Park Avenue, Florham Park, NJ 07932 wcohen@research.att.com

  • Venue:
  • Autonomous Agents and Multi-Agent Systems
  • Year:
  • 1999

Quantified Score

Hi-index 0.00

Visualization

Abstract

The degree to which information sources are pre-processed by Web-based information systems varies greatly. In search engines like Altavista, little pre-processing is done, while in “knowledge integration” systems, complex site-specific “wrappers” are used to integrate different information sources into a common database representation. In this paper we describe an intermediate point between these two models. In our system, information sources are converted into a highly structured collection of small fragments of text. Database-like queries to this structured collection of text fragments are approximated using a novel logic called WHIRL, which combines inference in the style of deductive databases with ranked retrieval methods from information retrieval (IR). WHIRL allows queries that integrate information from multiple Web sites, without requiring the extraction and normalization of object identifiers that can be used as keys; instead, operations that in conventional databases require equality tests on keys are approximated using IR similarity metrics for text. This leads to a reduction in the amount of human engineering required to field a knowledge integration system. Experimental evidence is given showing that many information sources can be easily modeled with WHIRL, and that inferences in the logic are both accurate and efficient.