On Finding Templates on Web Collections

Authors:
Karane Vieira;André Luiz Costa Carvalho;Klessius Berlt;Edleno S. Moura;Altigran S. Silva;Juliana Freire
Affiliations:
Department of Computer Science, Federal University of Amazonas, Manaus, Brazil;Department of Computer Science, Federal University of Amazonas, Manaus, Brazil;Department of Computer Science, Federal University of Amazonas, Manaus, Brazil;Department of Computer Science, Federal University of Amazonas, Manaus, Brazil;Department of Computer Science, Federal University of Amazonas, Manaus, Brazil;School of Computing, University of Utah, Salt Lake City, USA
Venue:
World Wide Web
Year:
2009

Citing 22
Cited 5

Identifying syntactic differences between two programs

Software—Practice & Experience
On the editing distance between unordered labeled trees

Information Processing Letters
User interface directions for the Web

Communications of the ACM
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
Enhanced topic distillation using text, markup tags, and hyperlinks

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
New algorithm for ordered tree-to-tree correction problem

Journal of Algorithms
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure

IEEE Transactions on Knowledge and Data Engineering
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A bag of paths model for measuring structural similarity in Web documents

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing
A methodology for clustering XML documents by structure

Information Systems
Template detection for large scale search engines

Proceedings of the 2006 ACM symposium on Applied computing
A fast and robust method for web page template detection and removal

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Vertical Navigation of Layout Adapted Web Documents

World Wide Web
Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Extracting Web Data Using Instance-Based Learning

World Wide Web
A Novelty-based Clustering Method for On-line Documents

World Wide Web
Intelligent Assistance in Authoring Dynamically Generated Web Interfaces

World Wide Web

A site oriented method for segmenting web pages

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Assessing the effort of repairing the accessibility of web sites

ICCHP'12 Proceedings of the 13th international conference on Computers Helping People with Special Needs - Volume Part I
Cluster-based page segmentation-a fast and precise method for web page pre-processing

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Locality sensitive hashing for scalable structural classification and clustering of web documents

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Similarity-based web browser optimization

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Templates are pieces of HTML code common to a set of web pages usually adopted by content providers to enhance the uniformity of layout and navigation of theirs Web sites. They are usually generated using authoring/publishing tools or by programs that build HTML pages to publish content from a database. In spite of their usefulness, the content of templates can negatively affect the quality of results produced by systems that automatically process information available in web sites, such as search engines, clustering and automatic categorization programs. Further, the information available in templates is redundant and thus processing and storing such information just once for a set of pages may save computational resources. In this paper, we present and evaluate methods for detecting templates considering a scenario where multiple templates can be found in a collection of Web pages. Most of previous work have studied template detection algorithms in a scenario where the collection has just a single template. The scenario with multiple templates is more realistic and, as it is discussed here, it raises important questions that may require extensions and adjustments in previously proposed template detection algorithms. We show how to apply and evaluate two template detection algorithms in this scenario, creating solutions for detecting multiple templates. The methods studied partitions the input collection into clusters that contain common HTML paths and share a high number of HTML nodes and then apply a single-template detection procedure over each cluster. We also propose a new algorithm for single template detection based on a restricted form of bottom-up tree-mapping that requires only small set of pages to correctly identify a template and which has a worst-case linear complexity. Our experimental results over a representative set of Web pages show that our approach is efficient and scalable while obtaining accurate results.