Classification of HTML Documents by Hidden Tree-Markov Models

  • Authors:
  • F. Scarselli

  • Affiliations:
  • -

  • Venue:
  • ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

Abstract: Content-based search and organization of Web documents poses new issues in Information Retrieval. We propose a novel approach for the classification of HTML documents based on a structured representation of their contents which are splitted into logical contexts (paragraphs, sections, anchors, etc.). The classification is performed using Hidden Tree-Markov Models (HTMMs), an extension of Hidden Markov Models for processing structured objects. We report some promising experimental results showing that the use of the structured representation improves the classification accuracy in most of the cases.