ParaSite: mining structural information on the Web
Selected papers from the sixth international conference on World Wide Web
Finding related pages in the World Wide Web
WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery
WWW '99 Proceedings of the eighth international conference on World Wide Web
WebOQL: restructuring documents, databases, and webs
Theory and Practice of Object Systems
XTRACT: a system for extracting document type descriptors from XML documents
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Wrapping-oriented classification of web pages
Proceedings of the 2002 ACM symposium on Applied computing
Using web structure for classifying and describing web pages
Proceedings of the 11th international conference on World Wide Web
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Information Retrieval
Managing Web-Based Data: Database Models and Transformations
IEEE Internet Computing
Mining the Web's Link Structure
Computer
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Data-rich Section Extraction from HTML pages
WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Wiccap Data Model: Mapping Physical Websites to Logical Views
ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining Web Informative Structures and Contents Based on Entropy Analysis
IEEE Transactions on Knowledge and Data Engineering
Fine-grain web site structure discovery
WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Automatic generation of agents for collecting hidden web pages for data extraction
Data & Knowledge Engineering - Special issue: WIDM 2002
Towards building logical views of websites
Data & Knowledge Engineering - Special issue: WIDM 2002
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)
Structure-driven crawler generation by example
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting lists of data records from semi-structured web pages
Data & Knowledge Engineering
iRobot: an intelligent crawler for web forums
Proceedings of the 17th international conference on World Wide Web
Improving density-based methods for hierarchical clustering of web pages
Data & Knowledge Engineering
Data & Knowledge Engineering
Clustering of document collection - A weighting approach
Expert Systems with Applications: An International Journal
A Structured Approach to Data Reverse Engineering of Web Applications
ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Site-Wide Wrapper Induction for Life Science Deep Web Databases
DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
Finding and Extracting Data Records from Web Pages
Journal of Signal Processing Systems
Using structured tokens to identify webpages for data extraction
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
The paths more taken: matching DOM trees to search logs for accurate webpage clustering
Proceedings of the 19th international conference on World wide web
Visual structure-based web page clustering and retrieval
Proceedings of the 19th international conference on World wide web
Exploiting tree structure of a web page for clustering
International Journal of Knowledge and Web Intelligence
Growing parallel paths for entity-page discovery
Proceedings of the 20th international conference companion on World wide web
Highly efficient algorithms for structural clustering of large websites
Proceedings of the 20th international conference on World wide web
A site oriented method for segmenting web pages
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Myngle: unifying and filtering web content for unplanned access between multiple personal devices
Proceedings of the 13th international conference on Ubiquitous computing
Hierarchical web-page clustering via in-page and cross-page link structures
PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
The parallel path framework for entity discovery on the web
ACM Transactions on the Web (TWEB)
Web Intelligence and Agent Systems
Hi-index | 0.00 |
Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we have developed an algorithm that accepts the URL of an entry point to a target Web site, visits a limited yet representative number of pages, and produces an accurate clustering of pages based on their structure. We have developed a prototype, which has been used to perform experiments on real-life Web sites.