Highly efficient algorithms for structural clustering of large websites

Authors:
Lorenzo Blanco;Nilesh Dalvi;Ashwin Machanavajjhala
Affiliations:
Università degli Studi Roma Tre, Rome, Italy;Yahoo! Research, Santa Clara, CA, USA;Yahoo! Research, Santa Clara, CA, USA
Venue:
Proceedings of the 20th international conference on World wide web
Year:
2011

Citing 20
Cited 11

Generating finite-state transducers for semi-structured data extraction from the Web

Information Systems - Special issue on semistructured data
Wrapping-oriented classification of web pages

Proceedings of the 2002 ACM symposium on Applied computing
XClust: clustering XML schemas for effective integration

Proceedings of the eleventh international conference on Information and knowledge management
Wrapping web data into XML

ACM SIGMOD Record
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto

Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites

Proceedings of the 27th International Conference on Very Large Data Bases
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

IEEE Transactions on Knowledge and Data Engineering
A tree-based approach to clustering XML documents by structure

PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Clustering web pages based on their structure

Data & Knowledge Engineering - Special issue: WIDM 2003
Interactive wrapper generation with minimal user effort

Proceedings of the 15th international conference on World Wide Web
Xproj: a framework for projected structural clustering of xml documents

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
A web of concepts

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Robust web extraction: an approach based on a probabilistic tree-edit model

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Answering table augmentation queries from unstructured lists on the web

Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web

Proceedings of the VLDB Endowment
A methodology for clustering XML documents by structure

Information Systems
Clustering template based web documents

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Exploiting content redundancy for web information extraction

Proceedings of the VLDB Endowment
Collective extraction from heterogeneous web lists

Proceedings of the fourth ACM international conference on Web search and data mining

A tool for link-based web page classification

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
A statistical approach to URL-based web page clustering

Proceedings of the 21st international conference companion on World Wide Web
Automatic web-scale information extraction

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A novel focused crawler based on breadcrumb navigation

ICSI'12 Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II
Towards discovering ontological models from big RDF data

ER'12 Proceedings of the 2012 international conference on Advances in Conceptual Modeling
Towards discovering conceptual models behind web sites

ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Towards web-scale structured web data extraction

Proceedings of the sixth ACM international conference on Web search and data mining
Browse with a social web directory

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Locality sensitive hashing for scalable structural classification and clustering of web documents

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Scalable and noise tolerant web knowledge extraction for search task simplification

Decision Support Systems
CALA: An unsupervised URL-based web page classification system

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a highly scalable algorithm for structurally clustering webpages for extraction. We show that, using only the URLs of the webpages and simple content features, it is possible to cluster webpages effectively and efficiently. At the heart of our techniques is a principled framework, based on the principles of information theory, that allows us to effectively leverage the URLs, and combine them with content and structural properties. Using an extensive evaluation over several large full websites, we demonstrate the effectiveness of our techniques, at a scale unattainable by previous techniques.