Latent dirichlet allocation in web spam filtering

  • Authors:
  • István Bíró;Jácint Szabó;András A. Benczúr

  • Affiliations:
  • Computer and Automation Research Institute of the Hungarian Academy of Sciences;Computer and Automation Research Institute of the Hungarian Academy of Sciences;Computer and Automation Research Institute of the Hungarian Academy of Sciences

  • Venue:
  • AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply a modification of LDA, the novel multi-corpus LDA technique for web spam classification. We create a bag-of-words document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if its total spam topic probability is above a threshold. As far as we know, this is the first web retrieval application of LDA. We test this method on the UK2007-WEBSPAM corpus, and reach a relative improvement of 11% in F-measure by a logistic regression based combination with strong link and content baseline classifiers.