Understanding web documents: finding pagelets for transformation using structural patterns

  • Authors:
  • Reza Ferrydiansyah;Bambang Parmanto

  • Affiliations:
  • Health Information Management, School of Health and Rehabilitation Sciences, University of Pittsburgh, 6025 Forbes Tower, HIM, Pittsburgh, PA 15260, USA.;Health Information Management, School of Health and Rehabilitation Sciences, University of Pittsburgh, 6026 Forbes Tower, HIM, Pittsburgh, PA 15260, USA

  • Venue:
  • International Journal of Web Engineering and Technology
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Understanding a web document and the sections inside the document is very important for web transformation and information retrieval from web pages. Detecting pagelets, which are small features located inside a web page, in order to understand a web document's structure is a difficult problem. Current work on pagelet detection focuses only on finding the location of the pagelet without regard to its functionality. We describe a method to detect both the location and functionality of pagelets using HTML element patterns. For each pagelet type, an HTML element pattern is created and matched to a web page. Sections of the web page that matches the patterns are marked as pagelet candidates. We test this technique on multiple popular web pages from the news and e-commerce genres. We find that this method adequately recalls various pagelets from the web page.