Extracting the Latent Hierarchical Structure of Web Documents

  • Authors:
  • Michael A. El-Shayeb;Samhaa R. El-Beltagy;Ahmed Rafea

  • Affiliations:
  • Computer Science Department, Faculty of Computers and Information, Cairo University, Giza, Egypt 12613;Computer Science Department, Faculty of Computers and Information, Cairo University, Giza, Egypt 12613;Computer Science Department, American University in Cairo, Cairo, Egypt

  • Venue:
  • Advanced Internet Based Systems and Applications
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

The hierarchical structure of a document plays an important role in understanding the relationships between its contents. However, such a structure is not always explicitly represented in web documents through available html hierarchical tags. Headings however, are usually differentiated from `normal' text in a document in terms of presentation thus providing an implicit structure discernable by a human reader. As such, an important pre-processing step for applications that need to operate on the hierarchical level is to extract the implicitly represented hierarchal structure. In this paper, an algorithm for heading detection and heading level detection which makes use of various visual presentations is presented. Results of evaluating this algorithm are also reported.