Linked latent Dirichlet allocation in web spam filtering

  • Authors:
  • István Bíró;Dávid Siklósi;Jácint Szabó;András A. Benczúr

  • Affiliations:
  • Computer and Automation Research Institute of the Hungarian Academy of Sciences;Computer and Automation Research Institute of the Hungarian Academy of Sciences;Computer and Automation Research Institute of the Hungarian Academy of Sciences;Computer and Automation Research Institute of the Hungarian Academy of Sciences

  • Venue:
  • Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply an extension of LDA for web spam classification. Our linked LDA technique takes also linkage into account: topics are propagated along links in such a way that the linked document directly influences the words in the linking document. The inferred LDA model can be applied for classification as dimensionality reduction similarly to latent semantic indexing. We test linked LDA on the WEBSPAM-UK2007 corpus. By using BayesNet classifier, in terms of the AUC of classification, we achieve 3% improvement over plain LDA with BayesNet, and 8% over the public link features with C4.5. The addition of this method to a log-odds based combination of strong link and content baseline classifiers results in a 3% improvement in AUC. Our method even slightly improves over the best Web Spam Challenge 2008 result.