Identifying Spam Web Pages Based on Content Similarity

  • Authors:
  • Maria Soledad Pera;Yiu-Kai Ng

  • Affiliations:
  • Computer Science Department, Brigham Young University, Provo, U.S.A.;Computer Science Department, Brigham Young University, Provo, U.S.A.

  • Venue:
  • ICCSA '08 Proceedings of the international conference on Computational Science and Its Applications, Part II
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines are faced with an annoying problem: the presence of misleading Web pages, i.e., spamWeb pages, that are ranked among legitimate Web pages. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. In order to improve the quality of Web searches, the number of spam pages on the Web must be reduced, if they cannot be eradicated entirely. In this paper, we present a novel approach for identifying spam Web pages that have mismatched titles and bodies and/or low percentage of hidden content. By considering the content of Web pages, we develop a spam-detection tool that is (i) reliable, since we can accurately detect 94% of spam/legitimate Web pages, and (ii) computational inexpensive, since the word-correlation factors used for content analysis are precomputed. We have verified that our spam-detection approach outperforms existing anti-spam methods by an average of 10% in terms of F-measure.