Bridging the gap: from multi document Template Detection to single document Content Extraction

  • Authors:
  • Thomas Gottron

  • Affiliations:
  • Johannes Gutenberg-Universität Mainz, Mainz, Germany

  • Venue:
  • EuroIMSA '08 Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Template Detection algorithms use collections of web documents to determine the structure of a common underlying template. Content Extraction algorithms instead operate on a single document and use heuristics to determine the main content. In this paper we propose a way to combine the reliability and theoretic underpinning of the first world with the single document based approach of the latter. Starting from a single initial document we use the set of hyperlinked web pages to build the required training set for Template Detection automatically. By clustering the documents in this set according to their underlying templates we clean the training set from documents based on different templates. We confirm the applicability of the approach by using an entropy based Template Detection algorithm to build a Content Extractor.