Identifying syntactic differences between two programs
Software—Practice & Experience
On the editing distance between unordered labeled trees
Information Processing Letters
User interface directions for the Web
Communications of the ACM
The Tree-to-Tree Correction Problem
Journal of the ACM (JACM)
Enhanced topic distillation using text, markup tags, and hyperlinks
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
New algorithm for ordered tree-to-tree correction problem
Journal of Algorithms
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure
IEEE Transactions on Knowledge and Data Engineering
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
A bag of paths model for measuring structural similarity in Web documents
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages
Proceedings of the 2005 ACM symposium on Applied computing
A methodology for clustering XML documents by structure
Information Systems
Template detection for large scale search engines
Proceedings of the 2006 ACM symposium on Applied computing
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Vertical Navigation of Layout Adapted Web Documents
World Wide Web
Page-level template detection via isotonic smoothing
Proceedings of the 16th international conference on World Wide Web
Extracting Web Data Using Instance-Based Learning
World Wide Web
A Novelty-based Clustering Method for On-line Documents
World Wide Web
A site oriented method for segmenting web pages
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Assessing the effort of repairing the accessibility of web sites
ICCHP'12 Proceedings of the 13th international conference on Computers Helping People with Special Needs - Volume Part I
Cluster-based page segmentation-a fast and precise method for web page pre-processing
Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
Locality sensitive hashing for scalable structural classification and clustering of web documents
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Similarity-based web browser optimization
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.00 |
Templates are pieces of HTML code common to a set of web pages usually adopted by content providers to enhance the uniformity of layout and navigation of theirs Web sites. They are usually generated using authoring/publishing tools or by programs that build HTML pages to publish content from a database. In spite of their usefulness, the content of templates can negatively affect the quality of results produced by systems that automatically process information available in web sites, such as search engines, clustering and automatic categorization programs. Further, the information available in templates is redundant and thus processing and storing such information just once for a set of pages may save computational resources. In this paper, we present and evaluate methods for detecting templates considering a scenario where multiple templates can be found in a collection of Web pages. Most of previous work have studied template detection algorithms in a scenario where the collection has just a single template. The scenario with multiple templates is more realistic and, as it is discussed here, it raises important questions that may require extensions and adjustments in previously proposed template detection algorithms. We show how to apply and evaluate two template detection algorithms in this scenario, creating solutions for detecting multiple templates. The methods studied partitions the input collection into clusters that contain common HTML paths and share a high number of HTML nodes and then apply a single-template detection procedure over each cluster. We also propose a new algorithm for single template detection based on a restricted form of bottom-up tree-mapping that requires only small set of pages to correctly identify a template and which has a worst-case linear complexity. Our experimental results over a representative set of Web pages show that our approach is efficient and scalable while obtaining accurate results.