Wrapping-oriented classification of web pages

  • Authors:
  • Valter Crescenzi;Giansalvatore Mecca;Paolo Merialdo

  • Affiliations:
  • Universitá Roma Tre, Via della Vasca Navale, 79, 00146 --- Roma, Italy;Universitá della Basilicata, C.da Macchia Romana, 85100 --- Potenza, Italy;Universitá Roma Tre, Via della Vasca Navale, 79, 00146 --- Roma, Italy

  • Venue:
  • Proceedings of the 2002 ACM symposium on Applied computing
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Data extraction from HTML Web pages is performed by software programs called wrapper. Writing wrappers is a costly and labor intensive task; recently several proposal have attacked the problem of automatically generating wrappers. In this paper, we study a problem related to the automation of the wrapping generation process: given a portion of a Web site to wrap, we develop techniques to cluster its HTML pages into page classes with homogeneous organization and layout; these classes can become the input to the wrapper generation process. Also, once a wrapper library has been generated for a bunch of Web sites, our techniques can be used in order to select, for any new page downloaded from these site, the right wrapper in the library. Based on the proposed techniques we have developed a software prototype, and conducted several experiments on HTML pages from real-life Web sites.