Bridging the gap: from multi document Template Detection to single document Content Extraction

Authors:
Thomas Gottron
Affiliations:
Johannes Gutenberg-Universität Mainz, Mainz, Germany
Venue:
EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
Year:
2008

Citing 12
Cited 4

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
QuASM: a system for question answering using semi-structured data

Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
DOM-based content extraction of HTML documents

WWW '03 Proceedings of the 12th international conference on World Wide Web
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
On the complexity of schema inference from web pages in the presence of nullable data attributes

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Identifying content blocks from web documents

ISMIS'05 Proceedings of the 15th international conference on Foundations of Intelligent Systems

Combining content extraction heuristics: the CombinE system

Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

Proceedings of the twelfth international workshop on Web information and data management
Cluster-based page segmentation-a fast and precise method for web page pre-processing

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Locality sensitive hashing for scalable structural classification and clustering of web documents

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Template Detection algorithms use collections of web documents to determine the structure of a common underlying template. Content Extraction algorithms instead operate on a single document and use heuristics to determine the main content. In this paper we propose a way to combine the reliability and theoretic underpinning of the first world with the single document based approach of the latter. Starting from a single initial document we use the set of hyperlinked web pages to build the required training set for Template Detection automatically. By clustering the documents in this set according to their underlying templates we clean the training set from documents based on different templates. We confirm the applicability of the approach by using an entropy based Template Detection algorithm to build a Content Extractor.