Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
Wrapping-oriented classification of web pages
Proceedings of the 2002 ACM symposium on Applied computing
XClust: clustering XML schemas for effective integration
Proceedings of the eleventh international conference on Information and knowledge management
ACM SIGMOD Record
Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Visual Web Information Extraction with Lixto
Proceedings of the 27th International Conference on Very Large Data Bases
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure
IEEE Transactions on Knowledge and Data Engineering
A tree-based approach to clustering XML documents by structure
PKDD '04 Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases
Clustering web pages based on their structure
Data & Knowledge Engineering - Special issue: WIDM 2003
Interactive wrapper generation with minimal user effort
Proceedings of the 15th international conference on World Wide Web
Xproj: a framework for projected structural clustering of xml documents
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Robust web extraction: an approach based on a probabilistic tree-edit model
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Answering table augmentation queries from unstructured lists on the web
Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
A methodology for clustering XML documents by structure
Information Systems
Clustering template based web documents
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Exploiting content redundancy for web information extraction
Proceedings of the VLDB Endowment
Collective extraction from heterogeneous web lists
Proceedings of the fourth ACM international conference on Web search and data mining
A tool for link-based web page classification
CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
A statistical approach to URL-based web page clustering
Proceedings of the 21st international conference companion on World Wide Web
Automatic web-scale information extraction
SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
A novel focused crawler based on breadcrumb navigation
ICSI'12 Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part II
Towards discovering ontological models from big RDF data
ER'12 Proceedings of the 2012 international conference on Advances in Conceptual Modeling
Towards discovering conceptual models behind web sites
ER'12 Proceedings of the 31st international conference on Conceptual Modeling
Towards web-scale structured web data extraction
Proceedings of the sixth ACM international conference on Web search and data mining
Browse with a social web directory
Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval
Locality sensitive hashing for scalable structural classification and clustering of web documents
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Scalable and noise tolerant web knowledge extraction for search task simplification
Decision Support Systems
CALA: An unsupervised URL-based web page classification system
Knowledge-Based Systems
Hi-index | 0.00 |
In this paper, we present a highly scalable algorithm for structurally clustering webpages for extraction. We show that, using only the URLs of the webpages and simple content features, it is possible to cluster webpages effectively and efficiently. At the heart of our techniques is a principled framework, based on the principles of information theory, that allows us to effectively leverage the URLs, and combine them with content and structural properties. Using an extensive evaluation over several large full websites, we demonstrate the effectiveness of our techniques, at a scale unattainable by previous techniques.