Splog detection using self-similarity analysis on blog temporal dynamics

Authors:
Yu-Ru Lin;Hari Sundaram;Yun Chi;Junichi Tatemura;Belle L. Tseng
Affiliations:
Arizona State University;Arizona State University;NEC Laboratories America, Cupertino, CA;NEC Laboratories America, Cupertino, CA;NEC Laboratories America, Cupertino, CA
Venue:
AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Year:
2007

Citing 6
Cited 21

Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Link spam detection based on mass estimation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Detecting Link Spam Using Temporal Information

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Detecting spam blogs: a machine learning approach

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)

Identifying the influential bloggers in a community

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
A comparative study of statistical features of language in blogs-vs-splogs

Proceedings of the second workshop on Analytics for noisy unstructured text data
Adversarial Information Retrieval on the Web (AIRWeb 2007)

ACM SIGIR Forum
A study of communities and influence in blogosphere

Proceedings of the 2nd SIGMOD PhD workshop on Innovative database research
Blogosphere: research issues, tools, and applications

ACM SIGKDD Explorations Newsletter
Analysing features of Japanese splogs and characteristics of keywords

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Discovering and Browsing of Power Users by Social Relationship Analysis in Large-Scale Online Communities

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Splog Filtering Based on Writing Consistency

WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Looking into the past to better classify web spam

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
An empirical study on selective sampling in active learning for splog detection

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Uncovering social spammers: social honeypots + machine learning

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Temporal query log profiling to improve web search ranking

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A tag-topic model for blog mining

Expert Systems with Applications: An International Journal
Quantifying sentiment and influence in blogspaces

Proceedings of the First Workshop on Social Media Analytics
Adversarial Web Search

Foundations and Trends in Information Retrieval
Comparing similarity of HTML structures and affiliate IDs in splog analysis

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Detecting splogs using similarities of splog HTML structures

Proceedings of the 4th International Conference on Uniquitous Information Management and Communication
Survey on web spam detection: principles and algorithms

ACM SIGKDD Explorations Newsletter
Information Retrieval on the Blogosphere

Foundations and Trends in Information Retrieval
Detecting Fake Medical Web Sites Using Recursive Trust Labeling

ACM Transactions on Information Systems (TOIS)
Probabilistic Models for Social Media Mining

International Journal of Information Technology and Web Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper focuses on spam blog (splog) detection. Blogs are highly popular, new media social communication mechanisms. The presence of splogs degrades blog search results as well as wastes network resources. In our approach we exploit unique blog temporal dynamics to detect splogs. There are three key ideas in our splog detection framework. We first represent the blog temporal dynamics using self-similarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts. Second, we show via a novel visualization that the blog temporal characteristics reveal attribute correlation, depending on type of the blog (normal blogs and splogs). Third, we propose the use of temporal structural properties computed from self-similarity matrices across different attributes. In a splog detector, these novel features are combined with content based features. We extract a content based feature vector from different parts of the blog -- URLs, post content, etc. The dimensionality of the feature vector is reduced by Fisher linear discriminant analysis. We have tested an SVM based splog detector using proposed features on real world datasets, with excellent results (90% accuracy).