A language modeling approach to information retrieval
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
An Evaluation of Statistical Approaches to Text Categorization
Information Retrieval
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
COMMIX: towards effective web information extraction, integration and query answering
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
Web classification using support vector machine
Proceedings of the 4th international workshop on Web information and data management
A Study of Approaches to Hypertext Categorization
Journal of Intelligent Information Systems
Exploiting hierarchical domain structure to compute similarity
ACM Transactions on Information Systems (TOIS)
Structured databases on the web: observations and implications
ACM SIGMOD Record
Hi-index | 0.00 |
This paper introduces a new approach of webpage template extraction. Unlike traditional methods which concern only content information, this paper considers both structure and content similarity. It uses natural table structure as content units instead of text blocks or pagelets. This paper novelly and formally defines the templates and other concepts. It introduces a new concept, candidate template, which is an intermediate level of abstract table structure. A candidate template only covers the most informative tables, and abstracts a large page set with similar structures. This paper proposes a novel approach of template extraction by solving three sub problems surrounding candidate template set. The involving of candidate template set solves the accuracy and efficiency problems of traditional approaches. This paper also introduces a new model for structural similarity, and for table informativeness based on six heuristics.