Unsupervised learning by probabilistic latent semantic analysis
Machine Learning
Challenges in web search engines
ACM SIGIR Forum
The Journal of Machine Learning Research
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages
Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting phrase-level duplication on the world wide web
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Spam Filtering Using Statistical Data Compression Models
The Journal of Machine Learning Research
Unsupervised prediction of citation influences
Proceedings of the 24th international conference on Machine learning
Know your neighbors: web spam detection using the web topology
SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Joint latent topic models for text and citations
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Web spam challenge proposal for filtering in archives
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Latent dirichlet allocation for tag recommendation
Proceedings of the third ACM conference on Recommender systems
Topic-based social network analysis for virtual communities of interests in the Dark Web
ACM SIGKDD Workshop on Intelligence and Security Informatics
Web spam classification: a few features worth more
Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Foundations and Trends in Information Retrieval
Pervasive and Mobile Computing
Effectively Detecting Content Spam on the Web Using Topical Diversity Measures
WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Automatically generated spam detection based on sentence-level topic information
Proceedings of the 22nd international conference on World Wide Web companion
Leveraging social network analysis with topic models and the Semantic Web extended
Web Intelligence and Agent Systems - Web Intelligence and Communities
Hi-index | 0.00 |
Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply an extension of LDA for web spam classification. Our linked LDA technique takes also linkage into account: topics are propagated along links in such a way that the linked document directly influences the words in the linking document. The inferred LDA model can be applied for classification as dimensionality reduction similarly to latent semantic indexing. We test linked LDA on the WEBSPAM-UK2007 corpus. By using BayesNet classifier, in terms of the AUC of classification, we achieve 3% improvement over plain LDA with BayesNet, and 8% over the public link features with C4.5. The addition of this method to a log-odds based combination of strong link and content baseline classifiers results in a 3% improvement in AUC. Our method even slightly improves over the best Web Spam Challenge 2008 result.