Clustering web pages based on their structure

Authors:
Valter Crescenzi;Paolo Merialdo;Paolo Missier
Affiliations:
Dipartimento di Informatica e Automazione, Università Roma Tre, Via della Vasca Navale, 79, Roma 00146, Italy;Dipartimento di Informatica e Automazione, Università Roma Tre, Via della Vasca Navale, 79, Roma 00146, Italy;Department of Computer Science, University of Manchester, Oxford Road, Manchester M13 9PL, UK
Venue:
Data & Knowledge Engineering - Special issue: WIDM 2003
Year:
2005

Citing 23
Cited 20

ParaSite: mining structural information on the Web

Selected papers from the sixth international conference on World Wide Web
Finding related pages in the World Wide Web

WWW '99 Proceedings of the eighth international conference on World Wide Web
Focused crawling: a new approach to topic-specific Web resource discovery

WWW '99 Proceedings of the eighth international conference on World Wide Web
WebOQL: restructuring documents, databases, and webs

Theory and Practice of Object Systems
XTRACT: a system for extracting document type descriptors from XML documents

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Wrapping-oriented classification of web pages

Proceedings of the 2002 ACM symposium on Applied computing
Using web structure for classifying and describing web pages

Proceedings of the 11th international conference on World Wide Web
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Information Retrieval

Information Retrieval
Managing Web-Based Data: Database Models and Transformations

IEEE Internet Computing
Mining the Web's Link Structure

Computer
To Weave the Web

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Crawling the Hidden Web

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
Data-rich Section Extraction from HTML pages

WISE '02 Proceedings of the 3rd International Conference on Web Information Systems Engineering
Wiccap Data Model: Mapping Physical Websites to Logical Views

ER '02 Proceedings of the 21st International Conference on Conceptual Modeling
Data extraction and label assignment for web databases

WWW '03 Proceedings of the 12th international conference on World Wide Web
Mining Web Informative Structures and Contents Based on Entropy Analysis

IEEE Transactions on Knowledge and Data Engineering
Fine-grain web site structure discovery

WIDM '03 Proceedings of the 5th ACM international workshop on Web information and data management
Automatic generation of agents for collecting hidden web pages for data extraction

Data & Knowledge Engineering - Special issue: WIDM 2002
Towards building logical views of websites

Data & Knowledge Engineering - Special issue: WIDM 2002
The Minimum Description Length Principle (Adaptive Computation and Machine Learning)

The Minimum Description Length Principle (Adaptive Computation and Machine Learning)

Structure-driven crawler generation by example

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting lists of data records from semi-structured web pages

Data & Knowledge Engineering
iRobot: an intelligent crawler for web forums

Proceedings of the 17th international conference on World Wide Web
Improving density-based methods for hierarchical clustering of web pages

Data & Knowledge Engineering
An unsupervised method for joint information extraction and feature mining across different Web sites

Data & Knowledge Engineering
Clustering of document collection - A weighting approach

Expert Systems with Applications: An International Journal
A Structured Approach to Data Reverse Engineering of Web Applications

ICWE '9 Proceedings of the 9th International Conference on Web Engineering
Site-Wide Wrapper Induction for Life Science Deep Web Databases

DILS '09 Proceedings of the 6th International Workshop on Data Integration in the Life Sciences
Finding and Extracting Data Records from Web Pages

Journal of Signal Processing Systems
Using structured tokens to identify webpages for data extraction

APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
The paths more taken: matching DOM trees to search logs for accurate webpage clustering

Proceedings of the 19th international conference on World wide web
Visual structure-based web page clustering and retrieval

Proceedings of the 19th international conference on World wide web
Exploiting tree structure of a web page for clustering

International Journal of Knowledge and Web Intelligence
Growing parallel paths for entity-page discovery

Proceedings of the 20th international conference companion on World wide web
Highly efficient algorithms for structural clustering of large websites

Proceedings of the 20th international conference on World wide web
A site oriented method for segmenting web pages

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Myngle: unifying and filtering web content for unplanned access between multiple personal devices

Proceedings of the 13th international conference on Ubiquitous computing
Hierarchical web-page clustering via in-page and cross-page link structures

PAKDD'10 Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
The parallel path framework for entity discovery on the web

ACM Transactions on the Web (TWEB)
Towards improving the online shopping experience: A client-based platform for post-processing Web search results

Web Intelligence and Agent Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Several techniques have been recently proposed to automatically generate Web wrappers, i.e., programs that extract data from HTML pages, and transform them into a more structured format, typically in XML. These techniques automatically induce a wrapper from a set of sample pages that share a common HTML template. An open issue, however, is how to collect suitable classes of sample pages to feed the wrapper inducer. Presently, the pages are chosen manually. In this paper, we tackle the problem of automatically discovering the main classes of pages offered by a site by exploring only a small yet representative portion of it. We propose a model to describe abstract structural features of HTML pages. Based on this model, we have developed an algorithm that accepts the URL of an entry point to a target Web site, visits a limited yet representative number of pages, and produces an accurate clustering of pages based on their structure. We have developed a prototype, which has been used to perform experiments on real-life Web sites.