WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES

  • Authors:
  • Valter Crescenzi;Paolo Merialdo

  • Affiliations:
  • Dipartimento di Informatica e Automazione, Università degli Studi Roma Tre, Roma, Italy;Dipartimento di Informatica e Automazione, Università degli Studi Roma Tre, Roma, Italy

  • Venue:
  • Applied Artificial Intelligence
  • Year:
  • 2008

Quantified Score

Hi-index 0.01

Visualization

Abstract

Several studies have concentrated on the generation of wrappers for web data sources. As wrappers can be easily described as grammars, the grammatical inference heritage could play a significant role in this research field. Recent results have identified a new subclass of regular languages, called prefix mark-up languages, that nicely abstract the structures usually found in HTML pages of large web sites. This class has been proven to be identifiable in the limit, and a PTIME unsupervised learning algorithm has been previously developed. Unfortunately, many real-life web pages do not fall in this class of languages. In this article we analyze the roots of the problem and we propose a technique to transform pages in order to bring them into the class of prefix mark-up languages. In this way, we have a practical solution without renouncing to the formal background defined within the grammatical inference framework. We report on some experiments that we have conducted on real-life web pages to evaluate the approach; the results of this activity demonstrate the effectiveness of the presented techniques.