Mining HTML Pages to Support Document Sharing in a Cooperative System
EDBT '02 Proceedings of the Worshops XMLDM, MDDE, and YRWS on XML-Based Data Management and Multimedia Engineering-Revised Papers
Structured multimedia document classification
Proceedings of the 2003 ACM symposium on Document engineering
Bayesian network model for semi-structured document classification
Information Processing and Management: an International Journal - Special issue: Bayesian networks and information retrieval
Hierarchical topic segmentation of websites
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Two-phase Web site classification based on Hidden Markov Tree models
Web Intelligence and Agent Systems
MLDM'03 Proceedings of the 3rd international conference on Machine learning and data mining in pattern recognition
Discovering missing values in semi-structured databases
Large Scale Semantic Access to Content (Text, Image, Video, and Sound)
Hi-index | 0.00 |
Abstract: Content-based search and organization of Web documents poses new issues in Information Retrieval. We propose a novel approach for the classification of HTML documents based on a structured representation of their contents which are splitted into logical contexts (paragraphs, sections, anchors, etc.). The classification is performed using Hidden Tree-Markov Models (HTMMs), an extension of Hidden Markov Models for processing structured objects. We report some promising experimental results showing that the use of the structured representation improves the classification accuracy in most of the cases.