Capturing Semantics in HTML Documents

  • Authors:
  • Mengchi Liu

  • Affiliations:
  • -

  • Venue:
  • DEXA '02 Proceedings of the 13th International Conference on Database and Expert Systems Applications
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

Most documents available over the web confirm to the HTML specification. They are intended to be human readable through a web browser and thus are constructed following some common conventions. Based on such common conventions, the Conceptual Model for HTML was proposed recently to automatically capture the hierarchical structure within web documents. However, certain key semantic information about the contents in the documents, which are obvious to human, are often omitted. As a result, web data processing, manipulation and integration are still quite difficult. In this paper, we discuss how to extend the Conceptual Model for HTML to capture the intended semantics of the HTML documents. We show that with the new constructs introduced, using an Intelligent Wrapper, and limited human interaction, semantics can be transferred from human into the Extended Conceptual Model so that further meaningful processing, manipulation and integration of web documents become possible.