A Heuristic Approach for Converting HTML Documents to XML Documents

  • Authors:
  • Seung Jin Lim;Yiu-Kai Ng

  • Affiliations:
  • -;-

  • Venue:
  • CL '00 Proceedings of the First International Conference on Computational Logic
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

XML is rapidly emerging, and yet there still exist numerous HTML documents on the Web. In this paper, we present a heuristic approach for converting HTML documents to XML documents. During the conversion process, we eliminate all the HTML elements in an HTML document from the resulting XML document since these elements are designed for the display of data exclusively, but retain the character data of each element along with the implicit hierarchy among the data. The proposed conversion approach extracts the data hierarchy of HTML documents as closely as possible with no human intervention. The approach can be adopted to construct the data hierarchy of an HTML document and to collect data in HTML documents into an XML repository.