Syntactic clustering of the Web
Selected papers from the sixth international conference on World Wide Web
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
QuASM: a system for question answering using semi-structured data
Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DOM-based content extraction of HTML documents
WWW '03 Proceedings of the 12th international conference on World Wide Web
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
On the complexity of schema inference from web pages in the presence of nullable data attributes
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages
Proceedings of the 2005 ACM symposium on Applied computing
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Identifying content blocks from web documents
ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems
Combining content extraction heuristics: the CombinE system
Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
Proceedings of the twelfth international workshop on Web information and data management
Cluster-based page segmentation-a fast and precise method for web page pre-processing
Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Locality sensitive hashing for scalable structural classification and clustering of web documents
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
Template Detection algorithms use collections of web documents to determine the structure of a common underlying template. Content Extraction algorithms instead operate on a single document and use heuristics to determine the main content. In this paper we propose a way to combine the reliability and theoretic underpinning of the first world with the single document based approach of the latter. Starting from a single initial document we use the set of hyperlinked web pages to build the required training set for Template Detection automatically. By clustering the documents in this set according to their underlying templates we clean the training set from documents based on different templates. We confirm the applicability of the approach by using an entropy based Template Detection algorithm to build a Content Extractor.