Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
XTRACT: a system for extracting document type descriptors from XML documents
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Wrapping-oriented classification of web pages
Proceedings of the 2002 ACM symposium on Applied computing
Machine Learning
Using Grammatical Inference to Automate Information Extraction from the Web
PKDD '01 Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Data-rich Section Extraction from HTML pages
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)
Clustering web pages based on their structure
Data & Knowledge Engineering - Special issue: WIDM 2003
AutoFeed: an unsupervised learning system for generating webfeeds
Proceedings of the 3rd international conference on Knowledge capture
Overview of autofeed: an unsupervised learning system for generating webfeeds
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
A site oriented method for segmenting web pages
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Hi-index | 0.00 |
Several techniques have been recently proposed to automatically derive web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML syntax. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually.In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small, representative, portion of it. The web site model we propose describes the structure of the site as a graph whose nodes are classes of pages that share a common structure, and whose edges represent links among instances of the page classes. Using this model, we have developed an algorithm that accepts the url of an entry point to the target web site, visits a limited portion of the site, and produces an accurate model of the site structure. We also report on preliminary experiments performed on actual web sites, that have produced encouraging results.