Web page DOM node characterization and its application to page segmentation

  • Authors:
  • Gujjar Vineel

  • Affiliations:
  • Computing and Decision Sciences Lab, GE Research, India

  • Venue:
  • IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web pages are generally organized in tenns of visually distinct segments, such as Navigation bars, Advertisement banners, Headers, Portlets and Widgets. Despite the apparent structured layout, web pages are considered a source of unstructured data, from information extraction point of view. Hence, as a step towards interpreting the organization of web data, web page segmentation attempts to identify cohesive regions of a page. In this paper, we present a novel DOM tree mining approach for page segmentation. We first characterize the nodes of DOM tree structure, based on their Content Size and Entropy. While Content Size of a node indicates the amount of textual content contributed by its subtree, Entropy measures the strength of local "patterns" exhibited therein. In other words, a node manifesting highly repetitive patterns begets a high Entropy as per our fonnulation. Based on the characterization of DOM nodes, we then develop an unsupervised algorithm to automatically identify segments of a given web page.