Effectiveness of template detection on noise reduction and websites summarization

  • Authors:
  • Derar Alassi;Reda Alhajj

  • Affiliations:
  • Department of Computer Science, University of Calgary, Calgary, Alberta, Canada;Department of Computer Science, University of Calgary, Calgary, Alberta, Canada

  • Venue:
  • Information Sciences: an International Journal
  • Year:
  • 2013

Quantified Score

Hi-index 0.07

Visualization

Abstract

The World Wide Web is the most rapidly growing and accessible source of information. Its popularity has been largely influenced by the wide availability of the Internet in almost every modern house and even on the go after the wide-spread of the handheld devices. Yet, pages on the Web have an additional template (we call it noisy) information that does not add value to the actual content of the page. Even worse, it can harm the effectiveness of Web mining techniques; these templates could be eliminated by preprocessing. Templates form one popular type of noise on the Internet. In this paper, we introduce Noise Detector (ND) as an effective approach for detecting and removing templates from Web pages. ND segments Web pages into semantically coherent blocks. Then it computes content and structure similarities between these blocks; a presentational noise measure is used as well. ND dynamically calculates a threshold for differentiating noisy blocks. Provided that the investigated website has a single visible template, ND can detect the template with high accuracy using two pages only. However, ND can be expanded to detect multiple templates per website, and the challenge will be to minimize the number of pages to be checked. Further, ND leads to website summarization. The conducted experiments show that ND outperforms existing approaches in space complexity, time complexity (see Section 4.6 for more details on ND's processing time against other algorithms'), minimum requirements to produce acceptable results, and results accuracy.