An automatic data grabber for large web sites

Authors:
Valter Crescenzi;Giansalvatore Mecca;Paolo Merialdo;Paolo Missier
Affiliations:
Università Roma Tre, Roma, Italy;Università della Basilicata, Potenza, Italy;Università Roma Tre, Roma, Italy;Università Roma Tre, Roma, Italy
Venue:
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Year:
2004

Citing 9
Cited 4

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
RoadRunner: automatic data extraction from data-intensive web sites

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Collecting hidden weeb pages for data extraction

Proceedings of the 4th international workshop on Web information and data management
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data-rich Section Extraction from HTML pages

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining Web Informative Structures and Contents Based on Entropy Analysis

IEEE Transactions on Knowledge and Data Engineering

USING GRAMMATICAL INFERENCE TECHNIQUES TO LEARN ONTOLOGIES THAT DESCRIBE THE STRUCTURE OF DOMAIN INSTANCES

Applied Artificial Intelligence
Information Extraction

Foundations and Trends in Databases
Online social network profile data extraction for vulnerability analysis

International Journal of Internet Technology and Secured Transactions
An XML approach to semantically extract data from HTML tables

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

We demonstrate a system to automatically grab data from data intensive web sites. The system first infers a model that describes at the intensional level the web site as a collection of classes; each class represents a set of structurally homogeneous pages, and it is associated with a small set of representative pages. Based on the model a library of wrappers, one per class, is then inferred, with the help an external wrapper generator. The model, together with the library of wrappers, can thus be used to navigate the site and extract the data.