Classification of HTML Documents by Hidden Tree-Markov Models

Authors:
F. Scarselli
Affiliations:
-
Venue:
ICDAR '01 Proceedings of the Sixth International Conference on Document Analysis and Recognition
Year:
2001

Citing 0
Cited 7

Mining HTML Pages to Support Document Sharing in a Cooperative System

EDBT '02 Proceedings of the Worshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers
Structured multimedia document classification

Proceedings of the 2003 ACM symposium on Document engineering
Bayesian network model for semi-structured document classification

Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
Hierarchical topic segmentation of websites

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Two-phase Web site classification based on Hidden Markov Tree models

Web Intelligence and Agent Systems
A belief networks-based generative model for structured documents: an application to the XML categorization

MLDM'03 Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition
Discovering missing values in semi-structured databases

Large Scale Semantic Access to Content (Text, Image, Video, and Sound)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Abstract: Content-based search and organization of Web documents poses new issues in Information Retrieval. We propose a novel approach for the classification of HTML documents based on a structured representation of their contents which are splitted into logical contexts (paragraphs, sections, anchors, etc.). The classification is performed using Hidden Tree-Markov Models (HTMMs), an extension of Hidden Markov Models for processing structured objects. We report some promising experimental results showing that the use of the structured representation improves the classification accuracy in most of the cases.