Webspam demotion: Low complexity node aggregation methods

Authors:
Thomas Largillier;Sylvain Peyronnet
Affiliations:
Univ Paris-Sud, LRI, CNRS, INRIA, Orsay F-91405, France;Univ Paris-Sud, LRI, CNRS, INRIA, Orsay F-91405, France
Venue:
Neurocomputing
Year:
2012

Citing 13
Cited 1

A cluster algorithm for graphs

A cluster algorithm for graphs
The webgraph framework I: compression techniques

Proceedings of the 13th international conference on World Wide Web
Topical TrustRank: using topicality to combat web spam

Proceedings of the 15th international conference on World Wide Web
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Link spam detection based on mass estimation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges

IEEE Internet Computing
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Web spam identification through content and hyperlinks

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Robust PageRank and locally computable spam detection features

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
A study of link farm distribution and evolution using a time series of web snapshots

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Web spam identification through language model analysis

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Nonlinear static-rank computation

Proceedings of the 18th ACM conference on Information and knowledge management
Lightweight Clustering Methods for Webspam Demotion

WI-IAT '10 Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01

Editorial: Special issue on advances in web intelligence

Neurocomputing

Quantified Score

Hi-index	0.01

Visualization

Abstract

Search engines result pages (SERPs) for a specific query are constructed according to several mechanisms. One of them consists in ranking Web pages regarding their importance, regardless of their semantic. Indeed, relevance to a query is not enough to provide high quality results, and popularity is used to arbitrate between equally relevant Web pages. The most well-known algorithm that ranks Web pages according to their popularity is the PageRank. The term Webspam was coined to denotes Web pages created with the only purpose of fooling ranking algorithms such as the PageRank. Indeed, the goal of Webspam is to promote a target page by increasing its rank. It is an important issue for Web search engines to spot and discard Webspam to provide their users with a nonbiased list of results. Webspam techniques are evolving constantly to remain efficient but most of the time they still consist in creating a specific linking architecture around the target page to increase its rank. In this paper we propose to study the effects of node aggregation on the well-known ranking algorithm of Google (the PageRank) in the presence of Webspam. Our node aggregation methods have the purpose to construct clusters of nodes that are considered as a sole node in the PageRank computation. Since the Web graph is way to big to apply classic clustering techniques, we present four lightweight aggregation techniques suitable for its size. Experimental results on the WEBSPAM-UK2007 dataset show the interest of the approach, which is moreover confirmed by statistical evidence.