Detecting splogs via temporal dynamics using self-similarity analysis

Authors:
Yu-Ru Lin;Hari Sundaram;Yun Chi;Junichi Tatemura;Belle L. Tseng
Affiliations:
Arizona State University, AZ;Arizona State University, AZ;NEC Laboratories America, Cupertino, CA;NEC Laboratories America, Cupertino, CA;NEC Laboratories America, Cupertino, CA
Venue:
ACM Transactions on the Web (TWEB)
Year:
2008

Citing 14
Cited 14

Color indexing

International Journal of Computer Vision
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Telling humans and computers apart automatically

Communications of the ACM - Information cities
Spam, damn spam, and statistics: using statistical analysis to locate spam web pages

Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004
Scaling link-based similarity search

WWW '05 Proceedings of the 14th international conference on World Wide Web
Identifying link farm spam pages

WWW '05 Special interest tracks and posters of the 14th international conference on World Wide Web
Detecting phrase-level duplication on the world wide web

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Link spam detection based on mass estimation

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A reference collection for web spam

ACM SIGIR Forum
Detecting Link Spam Using Temporal Information

ICDM '06 Proceedings of the Sixth International Conference on Data Mining
Combating web spam with trustrank

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Detecting spam blogs: a machine learning approach

AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Weblog classification for fast splog filtering: a URL language model segmentation approach

NAACL-Short '06 Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

Online spam-blog detection through blog search

Proceedings of the 17th ACM conference on Information and knowledge management
Annotating personal albums via web mining

MM '08 Proceedings of the 16th ACM international conference on Multimedia
Detecting spammers and content promoters in online video social networks

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
Applying an intelligent notification mechanism to blogging systems utilizing a genetic-based information retrieval approach

Expert Systems with Applications: An International Journal
A co-classification framework for detecting web spam and spammers in social media web sites

Proceedings of the 18th ACM conference on Information and knowledge management
Detectando usuários maliciosos em interações via vídeos no YouTube

Proceedings of the 14th Brazilian Symposium on Multimedia and the Web
A behavior-based SMS antispam system

IBM Journal of Research and Development
Detecting spam blogs from blog search results

Information Processing and Management: an International Journal
Adversarial Web Search

Foundations and Trends in Information Retrieval
Applying the data fusion technique to blog opinion retrieval

Expert Systems with Applications: An International Journal
Text mining and probabilistic language modeling for online review spam detection

ACM Transactions on Management Information Systems (TMIS)
Identifying important factors for future contribution of wikipedia editors

PKAW'12 Proceedings of the 12th Pacific Rim conference on Knowledge Management and Acquisition for Intelligent Systems
Connecting the dots: mass, energy, word meaning, and particle-wave duality

QI'12 Proceedings of the 6th international conference on Quantum Interaction
Feature identification for topical relevance assessment in feed search engines

Intelligent Data Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article addresses the problem of spam blog (splog) detection using temporal and structural regularity of content, post time and links. Splogs are undesirable blogs meant to attract search engine traffic, used solely for promoting affiliate sites. Blogs represent popular online media, and splogs not only degrade the quality of search engine results, but also waste network resources. The splog detection problem is made difficult due to the lack of stable content descriptors. We have developed a new technique for detecting splogs, based on the observation that a blog is a dynamic, growing sequence of entries (or posts) rather than a collection of individual pages. In our approach, splogs are recognized by their temporal characteristics and content. There are three key ideas in our splog detection framework. (a) We represent the blog temporal dynamics using self-similarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts, to investigate the temporal changes of the post sequence. (b) We study the blog temporal characteristics using a visual representation derived from the self-similarity measures. The visual signature reveals correlation between attributes and posts, depending on the type of blogs (normal blogs and splogs). (c) We propose two types of novel temporal features to capture the splog temporal characteristics. In our splog detector, these novel features are combined with content based features. We extract a content based feature vector from blog home pages as well as from different parts of the blog. The dimensionality of the feature vector is reduced by Fisher linear discriminant analysis. We have tested an SVM-based splog detector using proposed features on real world datasets, with appreciable results (90% accuracy).