Template detection for large scale search engines

Authors:
Liang Chen;Shaozhi Ye;Xing Li
Affiliations:
Tsinghua University, Beijing, P.R.China;University of California, Davis, CA;Tsinghua University, Beijing, P.R.China
Venue:
Proceedings of the 2006 ACM symposium on Applied computing
Year:
2006

Citing 13
Cited 12

Learning to remove Internet advertisements

Proceedings of the third annual conference on Autonomous Agents
Space/time trade-offs in hash coding with allowable errors

Communications of the ACM
Template detection via data mining and its applications

Proceedings of the 11th international conference on World Wide Web
Using micro information units for internet search

Proceedings of the eleventh international conference on Information and knowledge management
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting structured data from Web pages

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Eliminating noisy information in Web pages for data mining

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting unstructured data from template generated web documents

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Automatic detection of fragments in dynamically generated web pages

Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance

Proceedings of the 13th international conference on World Wide Web
Block-level link analysis

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Block-based web search

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting content structure for web pages based on visual representation

APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications

Implementation and evaluation of a quality-based search engine

Proceedings of the seventeenth conference on Hypertext and hypermedia
Different indexing strategies for multilingual web retrieval: experiments with the EuroGOV corpus

Proceedings of the seventeenth conference on Hypertext and hypermedia
Removing manually generated boilerplate from electronic texts: experiments with project Gutenberg e-books

CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Tracking Web spam with HTML style similarities

ACM Transactions on the Web (TWEB)
Site-Independent Template-Block Detection

PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Extracting article text from the web with maximum subsequence segmentation

Proceedings of the 18th international conference on World wide web
On Finding Templates on Web Collections

World Wide Web
CETR: content extraction via tag ratios

Proceedings of the 19th international conference on World wide web
DOM based content extraction via text density

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm

WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Multilingual web retrieval experiments with field specific indexing strategies for WebCLEF 2006 at the University of Hildesheim

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
URL tree: efficient unsupervised content extraction from streams of web documents

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Templates in web sites hurt search engine retrieval performance, especially in content relevance and link analysis. Current template removal methods suffer from processing speed and scalability when dealing with large volume web pages. In this paper, we propose a novel two-stage template detection method, which combines template detection and removal with the index building process of a search engine. First, web pages are segmented into blocks and blocks are clustered according to their style features. Second, similar contents sharing the common layout style are detected during the index building process. The blocks with similar layout style and content are identified as templates and deleted. Our experiment on eight popular web sites shows that our method achieves 20-40% faster than shingle and SST methods with close accuracy.