Identifying syntactic differences between two programs
Software—Practice & Experience
Function-based object model towards website adaptation
Proceedings of the 10th international conference on World Wide Web
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
Discovering informative content blocks from Web documents
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
The editing distance between trees: algorithms and applications
The editing distance between trees: algorithms and applications
Eliminating noisy information in Web pages for data mining
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Editorial: special issue on web content mining
ACM SIGKDD Explorations Newsletter
The volume and evolution of web page templates
WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Automatic extraction of informative blocks from webpages
Proceedings of the 2005 ACM symposium on Applied computing
A fast and robust method for web page template detection and removal
CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications)
On the quality of resources on the Web: An information retrieval perspective
Information Sciences: an International Journal
Computing block importance for searching on web sites
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Access concentration detection in click logs to improve mobile Web-IR
Information Sciences: an International Journal
Information Sciences: an International Journal
Cleaning web pages for effective web content mining
DEXA'06 Proceedings of the 17th international conference on Database and Expert Systems Applications
Information Sciences: an International Journal
Joint image denoising using adaptive principal component analysis and self-similarity
Information Sciences: an International Journal
CoBAn: A context based model for data leakage prevention
Information Sciences: an International Journal
Hi-index | 0.07 |
The World Wide Web is the most rapidly growing and accessible source of information. Its popularity has been largely influenced by the wide availability of the Internet in almost every modern house and even on the go after the wide-spread of the handheld devices. Yet, pages on the Web have an additional template (we call it noisy) information that does not add value to the actual content of the page. Even worse, it can harm the effectiveness of Web mining techniques; these templates could be eliminated by preprocessing. Templates form one popular type of noise on the Internet. In this paper, we introduce Noise Detector (ND) as an effective approach for detecting and removing templates from Web pages. ND segments Web pages into semantically coherent blocks. Then it computes content and structure similarities between these blocks; a presentational noise measure is used as well. ND dynamically calculates a threshold for differentiating noisy blocks. Provided that the investigated website has a single visible template, ND can detect the template with high accuracy using two pages only. However, ND can be expanded to detect multiple templates per website, and the challenge will be to minimize the number of pages to be checked. Further, ND leads to website summarization. The conducted experiments show that ND outperforms existing approaches in space complexity, time complexity (see Section 4.6 for more details on ND's processing time against other algorithms'), minimum requirements to produce acceptable results, and results accuracy.