Web spam identification through language model analysis

  • Authors:
  • Juan Martinez-Romo;Lourdes Araujo

  • Affiliations:
  • UNED, Madrid, Spain;UNED, Madrid, Spain

  • Venue:
  • Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper applies a language model approach to different sources of information extracted from a Web page, in order to provide high quality indicators in the detection of Web Spam. Two pages linked by a hyperlink should be topically related, even though this were a weak contextual relation. For this reason we have analysed different sources of information of a Web page that belongs to the context of a link and we have applied Kullback-Leibler divergence on them for characterising the relationship between two linked pages. Moreover, we combine some of these sources of information in order to obtain richer language models. Given the different nature of internal and external links, in our study we also distinguished these types of links getting a significant improvement in classification tasks. The result is a system that improves the detection of Web Spam on two large and public datasets such as WEBSPAM-UK2006 and WEBSPAM-UK2007.