Learning to remove Internet advertisements
Proceedings of the third annual conference on Autonomous Agents
Space/time trade-offs in hash coding with allowable errors
Communications of the ACM
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Using micro information units for internet search
Proceedings of the eleventh international conference on Information and knowledge management
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting unstructured data from template generated web documents
CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Automatic detection of fragments in dynamically generated web pages
Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Extracting content structure for web pages based on visual representation
APWeb'03 Proceedings of the 5th Asia-Pacific web conference on Web technologies and applications
Implementation and evaluation of a quality-based search engine
Proceedings of the seventeenth conference on Hypertext and hypermedia
Different indexing strategies for multilingual web retrieval: experiments with the EuroGOV corpus
Proceedings of the seventeenth conference on Hypertext and hypermedia
CASCON '07 Proceedings of the 2007 conference of the center for advanced studies on Collaborative research
Tracking Web spam with HTML style similarities
ACM Transactions on the Web (TWEB)
Site-Independent Template-Block Detection
PKDD 2007 Proceedings of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases
Extracting article text from the web with maximum subsequence segmentation
Proceedings of the 18th international conference on World wide web
On Finding Templates on Web Collections
World Wide Web
CETR: content extraction via tag ratios
Proceedings of the 19th international conference on World wide web
DOM based content extraction via text density
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
URL tree: efficient unsupervised content extraction from streams of web documents
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
Hi-index | 0.00 |
Templates in web sites hurt search engine retrieval performance, especially in content relevance and link analysis. Current template removal methods suffer from processing speed and scalability when dealing with large volume web pages. In this paper, we propose a novel two-stage template detection method, which combines template detection and removal with the index building process of a search engine. First, web pages are segmented into blocks and blocks are clustered according to their style features. Second, similar contents sharing the common layout style are detected during the index building process. The blocks with similar layout style and content are identified as templates and deleted. Our experiment on eight popular web sites shows that our method achieves 20-40% faster than shingle and SST methods with close accuracy.