Latent dirichlet allocation in web spam filtering

Authors:
István Bíró;Jácint Szabó;András A. Benczúr
Affiliations:
Computer and Automation Research Institute of the Hungarian Academy of Sciences;Computer and Automation Research Institute of the Hungarian Academy of Sciences;Computer and Automation Research Institute of the Hungarian Academy of Sciences
Venue:
AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Year:
2008

Citing 14
Cited 12

Document length normalization

Information Processing and Management: an International Journal - Special issue: history of information science
Probabilistic latent semantic indexing

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Challenges in web search engines

ACM SIGIR Forum
The connectivity sonar: detecting site functionality by structural patterns

Proceedings of the fourteenth ACM conference on Hypertext and hypermedia
Latent dirichlet allocation

The Journal of Machine Learning Research
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Discovering Objects and their Localization in Images

ICCV '05 Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1 - Volume 01
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
On-line spam filter fusion

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Employing Latent Dirichlet Allocation for fraud detection in telecommunications

Pattern Recognition Letters
Expectation-propagation for the generative aspect model

UAI'02 Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence

Looking into the past to better classify web spam

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Latent Dirichlet Allocation for Automatic Document Categorization

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II
An efficient block model for clustering sparse graphs

Proceedings of the Eighth Workshop on Mining and Learning with Graphs
Temporal query log profiling to improve web search ranking

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Using biased discriminant analysis for email filtering

KES'10 Proceedings of the 14th international conference on Knowledge-based and intelligent information and engineering systems: Part I
Spam detection in online classified advertisements

Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
Adversarial Web Search

Foundations and Trends in Information Retrieval
PCA document reconstruction for email classification

Computational Statistics & Data Analysis
Comment spam detection by sequence mining

Proceedings of the fifth ACM international conference on Web search and data mining
Using semi-structured data for assessing research paper similarity

Information Sciences: an International Journal
Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Automatically generated spam detection based on sentence-level topic information

Proceedings of the 22nd international conference on World Wide Web companion

Quantified Score

Hi-index	0.00

Visualization

Abstract

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply a modification of LDA, the novel multi-corpus LDA technique for web spam classification. We create a bag-of-words document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if its total spam topic probability is above a threshold. As far as we know, this is the first web retrieval application of LDA. We test this method on the UK2007-WEBSPAM corpus, and reach a relative improvement of 11% in F-measure by a logistic regression based combination with strong link and content baseline classifiers.