A Case-Based Recognition of Semantic Structures in HTML Documents

Authors:
Masayuki Umehara;Koji Iwanuma;Hidetomo Nabeshima
Affiliations:
-;-;-
Venue:
IDEAL '02 Proceedings of the Third International Conference on Intelligent Data Engineering and Automated Learning
Year:
2002

Citing 7
Cited 1

Wrapper generation for semi-structured Internet sources

ACM SIGMOD Record
Recognizing structure in Web pages using similarity queries

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Regression testing for wrapper maintenance

AAAI '99/IAAI '99 Proceedings of the sixteenth national conference on Artificial intelligence and the eleventh Innovative applications of artificial intelligence conference innovative applications of artificial intelligence
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
An Automated Change Detection Algorithm for HTML Documents Based on Semantic Hierarchies

Proceedings of the 17th International Conference on Data Engineering
A Case-Based Transformation from HTML to XML

IDEAL '00 Proceedings of the Second International Conference on Intelligent Data Engineering and Automated Learning, Data Mining, Financial Engineering, and Intelligent Agents
Template-based information mining from HTML documents

AAAI'97/IAAI'97 Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative applications of artificial intelligence

Maintaining data consistency of XML databases using verification techniques

ASIAN'06 Proceedings of the 11th Asian computing science conference on Advances in computer science: secure software and related issues

Quantified Score

Hi-index	0.00

Visualization

Abstract

The recognition and extraction of semantic/logical structures in HTML documents are substantially important and difficult tasks for intelligent document processing. In this paper, we show that alignment is appropriate for recognizing characteristic semantic/logical structures of a series of HTML documents, within a framework of case-based reasoning. That is, given a series of HTML documents and a sample transformation from an HTML document into an XML format, then the alignment can identify semantic/logical structures in the remaining HTML documents of the series, by matching the text-block sequence of the remaining document with the one of the sample transformation. Several important properties of texts, such as continuity and sequentiality, can naturally be utilized by the alignment. The alignment technology can significantly improve the ability of the case-based transformation method which transforms a spatial/temporal series of HTML documents into machine-readable XML formats. Throughout experimental evaluations, we show that the case-based method with alignment achieved a highly accurate transformation of HTML documents into XML.