A fast and robust method for web page template detection and removal

Authors:
Karane Vieira;Altigran S. da Silva;Nick Pinto;Edleno S. de Moura;João M. B. Cavalcanti;Juliana Freire
Affiliations:
Universidade Federal do Amazonas, Manaus, AM, Brazil;Universidade Federal do Amazonas, Manaus, AM, Brazil;Universidade Federal do Amazonas, Manaus, AM, Brazil;Universidade Federal do Amazonas, Manaus, AM, Brazil;Universidade Federal do Amazonas, Manaus, AM, Brazil;University of Utah, Salt Lake City, UT
Venue:
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Year:
2006

Citing 18
Cited 21

Identifying syntactic differences between two programs

Software—Practice & Experience
On the editing distance between unordered labeled trees

Information Processing Letters
User interface directions for the Web

Communications of the ACM
Automatic resource compilation by analyzing hyperlink structure and associated text

WWW7 Proceedings of the seventh international conference on World Wide Web 7
The Tree-to-Tree Correction Problem

Journal of the ACM (JACM)
A comparison of techniques to find mirrored hosts on the WWW

Journal of the American Society for Information Science
Static index pruning for information retrieval systems

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Enhanced topic distillation using text, markup tags, and hyperlinks

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
New algorithm for ordered tree-to-tree correction problem

Journal of Algorithms
Machine Learning

Machine Learning
Comparing Hierarchical Data in External Memory

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages

Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Improving Web search efficiency via a locality based static pruning method

WWW '05 Proceedings of the 14th international conference on World Wide Web
The volume and evolution of web page templates

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages

Proceedings of the 2005 ACM symposium on Applied computing

Page-level template detection via isotonic smoothing

Proceedings of the 16th international conference on World Wide Web
Computing block importance for searching on web sites

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
A densitometric approach to web page segmentation

Proceedings of the 17th ACM conference on Information and knowledge management
On Finding Templates on Web Collections

World Wide Web
A fast and simple method for extracting relevant content from news webpages

Proceedings of the 18th ACM conference on Information and knowledge management
Boilerplate detection using shallow text features

Proceedings of the third ACM international conference on Web search and data mining
Web page DOM node characterization and its application to page segmentation

IMSAA'09 Proceedings of the 3rd IEEE international conference on Internet multimedia services architecture and applications
Document structure meets page layout: loopy random fields for web news content extraction

Proceedings of the 10th ACM symposium on Document engineering
A site oriented method for segmenting web pages

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
An efficient language-independent method to extract content from news webpages

Proceedings of the 11th ACM symposium on Document engineering
A tool for link-based web page classification

CAEPIA'11 Proceedings of the 14th international conference on Advances in artificial intelligence: spanish association for artificial intelligence
Intelligent web navigation

FDIA'09 Proceedings of the Third BCS-IRSG conference on Future Directions in Information Access
Information Retrieval on the Blogosphere

Foundations and Trends in Information Retrieval
Extracting informative textual parts from web pages containing user-generated content

Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies
Effectiveness of template detection on noise reduction and websites summarization

Information Sciences: an International Journal
Cluster-based page segmentation-a fast and precise method for web page pre-processing

Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics
A hybrid approach for extracting informative content from web pages

Information Processing and Management: an International Journal
URL tree: efficient unsupervised content extraction from streams of web documents

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Similarity-based web browser optimization

Proceedings of the 23rd international conference on World wide web
CALA: An unsupervised URL-based web page classification system

Knowledge-Based Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The widespread use of templates on the Web is considered harmful for two main reasons. Not only do they compromise the relevance judgment of many web IR and web mining methods such as clustering and classification, but they also negatively impact the performance and resource usage of tools that process web pages. In this paper we present a new method that efficiently and accurately removes templates found in collections of web pages. Our method works in two steps. First, the costly process of template detection is performed over a small set of sample pages. Then, the derived template is removed from the remaining pages in the collection. This leads to substantial performance gains when compared to previous approaches that combine template detection and removal. We show, through an experimental evaluation, that our approach is effective for identifying terms occurring in templates - obtaining F-measure values around 0.9, and that it also boosts the accuracy of web page clustering and classification methods.