Template extraction from candidate template set generation: a structure and content approach

Authors:
Hang Su;Qiaozhu Mei
Affiliations:
Vanderbilt University, Nashville, TN;University of Illinois at Urbana-Champaign, Urbana, IL
Venue:
Proceedings of the 43rd annual Southeast regional conference - Volume 2
Year:
2005

Citing 8
Cited 0

A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization

Information Retrieval
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
COMMIX: towards effective web information extraction, integration and query answering

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Web classification using support vector machine

Proceedings of the 4th international workshop on Web information and data management
A Study of Approaches to Hypertext Categorization

Journal of Intelligent Information Systems
Exploiting hierarchical domain structure to compute similarity

ACM Transactions on Information Systems (TOIS)
Structured databases on the web: observations and implications

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a new approach of webpage template extraction. Unlike traditional methods which concern only content information, this paper considers both structure and content similarity. It uses natural table structure as content units instead of text blocks or pagelets. This paper novelly and formally defines the templates and other concepts. It introduces a new concept, candidate template, which is an intermediate level of abstract table structure. A candidate template only covers the most informative tables, and abstracts a large page set with similar structures. This paper proposes a novel approach of template extraction by solving three sub problems surrounding candidate template set. The involving of candidate template set solves the accuracy and efficiency problems of traditional approaches. This paper also introduces a new model for structural similarity, and for table informativeness based on six heuristics.