An automatic data grabber for large web sites

  • Authors:
  • Valter Crescenzi;Giansalvatore Mecca;Paolo Merialdo;Paolo Missier

  • Affiliations:
  • Università Roma Tre, Roma, Italy;Università della Basilicata, Potenza, Italy;Università Roma Tre, Roma, Italy;Università Roma Tre, Roma, Italy

  • Venue:
  • VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

We demonstrate a system to automatically grab data from data intensive web sites. The system first infers a model that describes at the intensional level the web site as a collection of classes; each class represents a set of structurally homogeneous pages, and it is associated with a small set of representative pages. Based on the model a library of wrappers, one per class, is then inferred, with the help an external wrapper generator. The model, together with the library of wrappers, can thus be used to navigate the site and extract the data.