Web page sectioning using regex­-based template

  • Authors:
  • Rupesh R. Mehta;Amit Madaan

  • Affiliations:
  • Yahoo! R&D, Bangalore, India;Yahoo! R&D, Bangalore, India

  • Venue:
  • Proceedings of the 17th international conference on World Wide Web
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

This work aims to provide a novel, site-specific web page segmentation and section importance detection algorithm, which leverages structural, content, and visual information. The structural and content information is leveraged via template, a generalized regular expression learnt over set of pages. The template along with visual information results into high sectioning accuracy. The experimental results demonstrate the effectiveness of the approach.