Analysis and detection of web spam by means of web content

  • Authors:
  • Víctor M. Prieto;Manuel Álvarez;Rafael López-García;Fidel Cacheda

  • Affiliations:
  • Department of Information and Communication Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communication Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communication Technologies, University of A Coruña, A Coruña, Spain;Department of Information and Communication Technologies, University of A Coruña, A Coruña, Spain

  • Venue:
  • IRFC'12 Proceedings of the 5th conference on Multidisciplinary Information Retrieval
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Web Spam is one of the main difficulties that crawlers have to overcome. According to Gyöngyi and Garcia-Molina it is defined as "any deliberate human action that is meant to trigger an unjustifiably favourable relevance or importance of some web pages considering the pages' true value". There are several studies on characterising and detecting Web Spam pages. However, none of them deals with all the possible kinds of Web Spam. This paper shows an analysis of different kinds of Web Spam pages and identifies new elements that characterise it. Taking them into account, we propose a new Web Spam detection system called SAAD, which is based on a set of heuristics and their use in a C4.5 classifier. Its results are also improved by means of Bagging and Boosting techniques. We have also tested our system in some well-known Web Spam datasets and we have found it to be very effective.