Automatic Discovery of Semantic Structures in HTML Documents

  • Authors:
  • Saikat Mukherjee;Guizhen Yang;Wenfang Tan;I. V. Ramakrishnan

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ICDAR '03 Proceedings of the Seventh International Conference on Document Analysis and Recognition - Volume 1
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

Template-driven HTML documents posses an implicit,fixed schema denoting concepts and their relationships ina hierarchical fashion.Discovering this schema remains arelatively unexplored problem.By exploiting a key observationthat semantically related items in HTML documentsexhibit spatial locality, we develop an algorithm for automaticallypartitioning them into tree-like semantic structureswhich expose the implicit schema.