IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Site-wide annotation: reconstructing existing pages to be accessible
Proceedings of the fifth international ACM conference on Assistive technologies
Improving pseudo-relevance feedback in web information retrieval using web page segmentation
WWW '03 Proceedings of the 12th international conference on World Wide Web
Hearsay: enabling audio browsing on hypertext content
Proceedings of the 13th international conference on World Wide Web
DANTE: annotation and transformation of web pages for visually impaired users
Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Automating Content Extraction of HTML Documents
World Wide Web
Web data extraction based on partial tree alignment
WWW '05 Proceedings of the 14th international conference on World Wide Web
AcceSS: accessibility through simplification & summarization
W4A '05 Proceedings of the 2005 International Cross-Disciplinary Workshop on Web Accessibility (W4A)
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Hi-index | 0.00 |
Understanding a web document and the sections inside the document is very important for web transformation and information retrieval from web pages. Detecting pagelets, which are small features located inside a web page, in order to understand a web document's structure is a difficult problem. Current work on pagelet detection focuses only on finding the location of the pagelet without regard to its functionality. We describe a method to detect both the location and functionality of pagelets using HTML element patterns. For each pagelet type, an HTML element pattern is created and matched to a web page. Sections of the web page that matches the patterns are marked as pagelet candidates. We test this technique on multiple popular web pages from the news and e-commerce genres. We find that this method adequately recalls various pagelets from the web page.