Linked latent Dirichlet allocation in web spam filtering

Authors:
István Bíró;Dávid Siklósi;Jácint Szabó;András A. Benczúr
Affiliations:
Computer and Automation Research Institute of the Hungarian Academy of Sciences;Computer and Automation Research Institute of the Hungarian Academy of Sciences;Computer and Automation Research Institute of the Hungarian Academy of Sciences;Computer and Automation Research Institute of the Hungarian Academy of Sciences
Venue:
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Year:
2009

Citing 12
Cited 10

Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Challenges in web search engines

ACM SIGIR Forum
Latent dirichlet allocation

The Journal of Machine Learning Research
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
On-line spam filter fusion

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Unsupervised prediction of citation influences

Proceedings of the 24th international conference on Machine learning
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Joint latent topic models for text and citations

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Web spam challenge proposal for filtering in archives

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Latent dirichlet allocation for tag recommendation

Proceedings of the third ACM conference on Recommender systems
Topic-based social network analysis for virtual communities of interests in the Dark Web

ACM SIGKDD Workshop on Intelligence and Security Informatics
Web spam classification: a few features worth more

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Adversarial Web Search

Foundations and Trends in Information Retrieval
Opinion helpfulness prediction in the presence of "words of few mouths"

World Wide Web
ADR-SPLDA: Activity discovery and recognition by combining sequential patterns and latent Dirichlet allocation

Pervasive and Mobile Computing
Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Automatically generated spam detection based on sentence-level topic information

Proceedings of the 22nd international conference on World Wide Web companion
Leveraging social network analysis with topic models and the Semantic Web extended

Web Intelligence and Agent Systems - Web Intelligence and Communities

Quantified Score

Hi-index	0.00

Visualization

Abstract

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply an extension of LDA for web spam classification. Our linked LDA technique takes also linkage into account: topics are propagated along links in such a way that the linked document directly influences the words in the linking document. The inferred LDA model can be applied for classification as dimensionality reduction similarly to latent semantic indexing. We test linked LDA on the WEBSPAM-UK2007 corpus. By using BayesNet classifier, in terms of the AUC of classification, we achieve 3% improvement over plain LDA with BayesNet, and 8% over the public link features with C4.5. The addition of this method to a log-odds based combination of strong link and content baseline classifiers results in a 3% improvement in AUC. Our method even slightly improves over the best Web Spam Challenge 2008 result.