Extracting relational data from HTML repositories

  • Authors:
  • Ruth Yuee Zhang;Laks V. S. Lakshmanan;Ruben H. Zamar

  • Affiliations:
  • Univ. of British Columbia, Vancouver, BC Canada;Univ. of British Columbia, Vancouver, BC Canada;Univ. of British Columbia, Vancouver, BC Canada

  • Venue:
  • ACM SIGKDD Explorations Newsletter
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

There is a vast amount of valuable information in HTML documents, widely distributed across the World Wide Web and across corporate intranets. Unfortunately, HTML is mainly presentation oriented and hard to query. In this paper, we develop a system to extract desired information (records) from thousands of HTML documents, starting from a small set of examples. Duplicates in the result are automatically detected and eliminated. We propose a novel method to estimate the current coverage of results by the system, based on capture-recapture models with unequal capture probabilities. We also propose techniques for estimating the error rate of the extracted information and an interactive the technique for enhancing information quality. To evaluate the method and ideas proposed in this paper, we conducted an extensive set of experiments. Our experimental results validate the effectiveness and utility of our system, and demonstrate interesting tradeoffs between running time of information extraction and coverage of results.