Page-level template detection via isotonic smoothing

  • Authors:
  • Deepayan Chakrabarti;Ravi Kumar;Kunal Punera

  • Affiliations:
  • Yahoo! Research, Sunnyvale, CA;Yahoo! Research, Sunnyvale, CA;University of Texas at Austin, Austin, TX

  • Venue:
  • Proceedings of the 16th international conference on World Wide Web
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

We develop a novel framework for the page-level template detection problem. Our framework is built on two main ideas. The first is theautomatic generation of training data for a classifier that, given apage, assigns a templateness score to every DOM node of the page. The second is the global smoothing of these per-node classifier scores bysolving a regularized isotonic regression problem; the latter follows from a simple yet powerful abstraction of templateness on a page. Our extensive experiments on human-labeled test data show that our approachdetects templates effectively.