Using Grammatical Inference to Automate Information Extraction from the Web

  • Authors:
  • Theodore W. Hong;Keith L. Clark

  • Affiliations:
  • -;-

  • Venue:
  • PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

The World-Wide Web contains a wealth of semistructured information sources that often give partial/overlapping views on the same domains, such as real estate listings or book prices. These partial sources could be used more effectively if integrated into a single view; however, since they are typically formatted in diverse ways for human viewing, extracting their data for integration is a difficult challenge. Existing learning systems for this task generally use hardcoded ad hoc heuristics, are restricted in the domains and structures they can recognize, and/or require manual training. We describe a principled method for automatically generating extraction wrappers using grammatical inference that can recognize general structures and does not rely on manually-labelled examples. Domain-specific knowledge is explicitly separated out in the form of declarative rules. The method is demonstrated in a test setting by extracting real estate listings from web pages and integrating them into an interactive data visualization tool based on dynamic queries.