Fine-grain web site structure discovery

Authors:
Valter Crescenzi;Paolo Merialdo;Paolo Missier
Affiliations:
Università Roma Tre, Roma, Italy;Università Roma Tre, Roma, Italy;Università Roma Tre, Roma, Italy
Venue:
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Year:
2003

Citing 10
Cited 4

Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Wrapping-oriented classification of web pages

Proceedings of the 2002 ACM symposium on Applied computing
Machine Learning

Machine Learning
Using Grammatical Inference to Automate Information Extraction from the Web

PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data-rich Section Extraction from HTML pages

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)

The Minimum Description Length Principle (Adaptive Computation and Machine Learning)

Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
AutoFeed: an unsupervised learning system for generating webfeeds

Proceedings of the 3rd international conference on Knowledge capture
Overview of autofeed: an unsupervised learning system for generating webfeeds

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
A site oriented method for segmenting web pages

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several techniques have been recently proposed to automatically derive web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML syntax. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually.In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small, representative, portion of it. The web site model we propose describes the structure of the site as a graph whose nodes are classes of pages that share a common structure, and whose edges represent links among instances of the page classes. Using this model, we have developed an algorithm that accepts the url of an entry point to the target web site, visits a limited portion of the site, and produces an accurate model of the site structure. We also report on preliminary experiments performed on actual web sites, that have produced encouraging results.