Extracting Structures of HTML Documents

  • Authors:
  • Seung-Jin Lim;Yiu-Kai Ng

  • Affiliations:
  • -;-

  • Venue:
  • ICOIN '98 Proceedings of the 13th International Conference on Information Networking
  • Year:
  • 1998

Quantified Score

Hi-index 0.00

Visualization

Abstract

Abstract: Information on the Web, which are conglomeration of heterogeneous data, such as texts, images and audio clips, are often accessed through documents written according to the HTML specification. According to the HTML specification, HTML documents are semistructured in nature. We propose a high-level stack machine (HSM) which accesses an HTML document through its URL and constructs a semistructured data graph (SDG) of the document. The SDG of an HTML document H precisely captures the structure of the semistructured data embedded in H based on the dependency relationship among the data objects in H. HSM is configurable to accommodate a user's interest with respect to the HTML elements in H to be considered during the construction process of the SDG of H.