Combining content extraction heuristics: the CombinE system

  • Authors:
  • Thomas Gottron

  • Affiliations:
  • Johannes Gutenberg-Universität Mainz, Mainz, Germany

  • Venue:
  • Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The main text content of an HTML document on the WWW is typically surrounded by additional contents, such as navigation menus, advertisements, link lists or design elements. Content Extraction (CE) is the task to identify and extract the main content. Ongoing research has spawned several CE heuristics of different quality. However, so far only the Crunch framework combines several heuristics to improve its overall CE performance. Since Crunch, though, many new algorithms have been formulated. The CombinE system is designed to test, evaluate and optimise combinations of CE heuristics. Its aim is to develop CE systems which yield better and more reliable extracts of the main content of a web document.