Constraint-based wrapper specification and verification for cooperative information systems

  • Authors:
  • Thomas Y. Lee;Yingwei Yang

  • Affiliations:
  • Department of Operations and Information Management, University of Pennsylvania, 3730 Walnut Street, Philadelphia, PA;Department of Operations and Information Management, University of Pennsylvania, 3730 Walnut Street, Philadelphia, PA

  • Venue:
  • Information Systems - Special issue: Data quality in cooperative information systems
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we propose the use of semistructured constraints in wrappers to mitigate the impact of poor extraction accuracy on Cooperative Information System (CIS) data quality. Wrappers are a critical element of CISs whenever the constituent information systems publish semistructured text such as forms, reports, and memos rather than structured databases. The accuracy of CIS data that stem from text depends upon the wrappers as well as the accuracy of the underlying sources. Wrapper specification is the process of defining patterns (i.e. regular expressions) to extract information from semistructured text. Wrapper verification is the process of ensuring extraction accuracy--that the extracted information faithfully reflects the underlying source. We focus on the problem of extraction accuracy. We use constraints on semistructured data for both wrapper specification and verification. Consequently, we perform extraction and verification simultaneously. We apply the concept to wrappers for a Uniform Domain Name Dispute Resolution Policy (UDRP) CIS of arbitration decisions. UDRP decisions are currently distributed across arbitration authorities on three continents. The accuracy of data extracted using constraint-based specification and verification is measured by Type I and Type II errors.