Understanding web documents: finding pagelets for transformation using structural patterns

Authors:
Reza Ferrydiansyah;Bambang Parmanto
Affiliations:
Health Information Management, School of Health and Rehabilitation Sciences, University of Pittsburgh, 6025 Forbes Tower, HIM, Pittsburgh, PA 15260, USA.;Health Information Management, School of Health and Rehabilitation Sciences, University of Pittsburgh, 6026 Forbes Tower, HIM, Pittsburgh, PA 15260, USA
Venue:
International Journal of Web Engineering and Technology
Year:
2008

Citing 10
Cited 0

IEPAD: information extraction based on pattern discovery

Proceedings of the 10th international conference on World Wide Web
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Site-wide annotation: reconstructing existing pages to be accessible

Proceedings of the fifth international ACM conference on Assistive technologies
Improving pseudo-relevance feedback in web information retrieval using web page segmentation

WWW '03 Proceedings of the 12th international conference on World Wide Web
Hearsay: enabling audio browsing on hypertext content

Proceedings of the 13th international conference on World Wide Web
DANTE: annotation and transformation of web pages for visually impaired users

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Automating Content Extraction of HTML Documents

World Wide Web
Web data extraction based on partial tree alignment

WWW '05 Proceedings of the 14th international conference on World Wide Web
AcceSS: accessibility through simplification & summarization

W4A '05 Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A)
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Understanding a web document and the sections inside the document is very important for web transformation and information retrieval from web pages. Detecting pagelets, which are small features located inside a web page, in order to understand a web document's structure is a difficult problem. Current work on pagelet detection focuses only on finding the location of the pagelet without regard to its functionality. We describe a method to detect both the location and functionality of pagelets using HTML element patterns. For each pagelet type, an HTML element pattern is created and matched to a web page. Sections of the web page that matches the patterns are marked as pagelet candidates. We test this technique on multiple popular web pages from the news and e-commerce genres. We find that this method adequately recalls various pagelets from the web page.