Repetition-based web page segmentation by detecting tag patterns for small-screen devices

  • Authors:
  • J. Kang;J. Yang;J. Choi

  • Affiliations:
  • Department of Computer Science and Engineering, Hanyang University;-;-

  • Venue:
  • IEEE Transactions on Consumer Electronics
  • Year:
  • 2010

Quantified Score

Hi-index 0.43

Visualization

Abstract

Web page segmentation into logical blocks is an important preprocessing step for recognizing informative content blocks in a page that leads to efficient information extraction and convenient display on the devices with smallsized screens. Previous methods for Web page segmentation are not flexible in a dynamic Web environment because they largely relied on heuristic rules generated by exploiting structural tags and visual information inherent in a page. To resolve this problem, this paper proposes a new method of Web page segmentation by recognizing repetitive tag patterns called key patterns in the DOM tree structure of a page. We report on the Repetition-based Page Segmentation (REPS) algorithm, which detects key patterns in a page and generates virtual nodes to correctly segment nested blocks. A series of experiments performed for real Web sites showed that REPS greatly contributes to improving the correctness of Web page segmentation.