SAAD, a content based Web Spam Analyzer and Detector

  • Authors:
  • Víctor M. Prieto;Manuel Álvarez;Fidel Cacheda

  • Affiliations:
  • -;-;-

  • Venue:
  • Journal of Systems and Software
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web Spam is one of the main difficulties that crawlers have to overcome and therefore one of the main problems of the WWW. There are several studies about characterising and detecting Web Spam pages. However, none of them deals with all the possible kinds of Web Spam. This paper shows an analysis of different kinds of Web Spam pages and identifies new elements that characterise it, to define heuristics which are able to partially detect them. We also discuss and explain several heuristics from the point of view of their effectiveness and computational efficiency. Taking them into account, we study several sets of heuristics and demonstrate how they improve the current results. Finally, we propose a new Web Spam detection system called SAAD (Spam Analyzer And Detector), which is based on the set of proposed heuristics and their use in a C4.5 classifier improved by means of Bagging and Boosting techniques. We have also tested our system in some well known Web Spam datasets and we have found it to be very effective.