Lightweight Clustering Methods for Webspam Demotion

  • Authors:
  • Thomas Largillier;Sylvain Peyronnet

  • Affiliations:
  • -;-

  • Venue:
  • WI-IAT '10 Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

To make sure they can quickly respond to a specific query, the main search engines have several mechanisms. One of them consists in ranking web pages according to their importance, regardless of the semantic of the web page. Indeed, relevance to a query is not enough to provide a high quality result, and popularity is used to arbitrate between equally relevant web pages. Webspam widely denotes any web page created with the only purpose of fooling ranking algorithms such as the PageRank. The aim of Webspam is to promote a target page by increasing its rank. It is an important issue for Web search engines to spot and discard Webspam to provide their users with a non biased list of results. Webspam techniques have to evolve constantly to remain efficient but most of the time they consist in creating a specific linking architecture around the target page to increase its rank. In this paper we propose to study the effects of graph clustering on the well known ranking algorithm of Google (the PageRank) in presence of Webspam. Since the web graph is way to big to apply classic clustering techniques, we present three lightweight techniques to realise a clustering of the web graph. Experimental results show the interest of the approach, which is moreover confirmed by statistical evidence.