Combining textual content and hyperlinks in web spam detection

Authors:
F. Javier Ortega;Craig Macdonald;José A. Troyano;Fermín L. Cruz;Fernando Enríquez
Affiliations:
Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, Sevilla, Spain;Department of Computing Science, University of Glasgow, Glasgow, UK;Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, Sevilla, Spain;Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, Sevilla, Spain;Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, Sevilla, Spain
Venue:
NLDB'11 Proceedings of the 16th international conference on Natural language processing and information systems
Year:
2011

Citing 4
Cited 0

Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
A reference collection for web spam

ACM SIGIR Forum
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Link analysis for Web spam detection

ACM Transactions on the Web (TWEB)

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this work, we tackle the problem of spam detection on the Web. Spam web pages have become a problem for Web search engines, due to the negative effects that this phenomenon can cause in their retrieval results. Our approach is based on a random-walk algorithm that obtains a ranking of pages according to their relevance and their spam likelihood. We introduce the novelty of taking into account the content of the web pages to characterize the web graph and to obtain an a priori estimation of the spam likelihood of the web pages. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) a web page is, according to its textual content and the relations in the graph. Our experiments show that our proposed technique outperforms other link-based techniques for spam detection.