Template extraction from candidate template set generation: a structure and content approach

  • Authors:
  • Hang Su;Qiaozhu Mei

  • Affiliations:
  • Vanderbilt University, Nashville, TN;University of Illinois at Urbana-Champaign, Urbana, IL

  • Venue:
  • Proceedings of the 43rd annual Southeast regional conference - Volume 2
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper introduces a new approach of webpage template extraction. Unlike traditional methods which concern only content information, this paper considers both structure and content similarity. It uses natural table structure as content units instead of text blocks or pagelets. This paper novelly and formally defines the templates and other concepts. It introduces a new concept, candidate template, which is an intermediate level of abstract table structure. A candidate template only covers the most informative tables, and abstracts a large page set with similar structures. This paper proposes a novel approach of template extraction by solving three sub problems surrounding candidate template set. The involving of candidate template set solves the accuracy and efficiency problems of traditional approaches. This paper also introduces a new model for structural similarity, and for table informativeness based on six heuristics.