Detecting splogs using similarities of splog HTML structures

Authors:
Taichi Katayama;Takayuki Yoshinaka;Takehito Utsuro;Yasuhide Kawada;Tomohiro Fukuhara
Affiliations:
University of Tsukuba, Tsukuba, Japan;Tokyo Denki University, Tokyo, Japan;University of Tsukuba, Tsukuba, Japan;Navix Co., Ltd., Tokyo, Japan;University of Tokyo, Kashiwa, Japan
Venue:
Proceedings of the 4th International Conference on Uniquitous Information Management and Communication
Year:
2010

Citing 10
Cited 1

A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Less is More: Active Learning with Support Vector Machines

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Support Vector Machine Active Learning with Application sto Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Discovering informative content blocks from Web documents

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Automatically collecting, monitoring, and mining japanese weblogs

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Automatic Identification of Informative Sections of Web Pages

IEEE Transactions on Knowledge and Data Engineering
Spam double-funnel: connecting web spammers with advertisers

Proceedings of the 16th international conference on World Wide Web
Splog detection using self-similarity analysis on blog temporal dynamics

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Analysing features of Japanese splogs and characteristics of keywords

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
An empirical study on selective sampling in active learning for splog detection

Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

Comparing similarity of HTML structures and affiliate IDs in splog analysis

DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spam blogs or splogs are blogs hosting spam posts, created using machine generated or hijacked content for the sole purpose of hosting advertisements or increasing the number of inlinks of target sites. Among those splogs, this paper focuses on detecting a group of splogs which are estimated to be created by an identical spammer. We especially show that similarities of html structures among those splogs created by an identical spammer contribute to improving the performance of splog detection. In measuring similarities of html structures, we extract a list of blocks (minimum unit of content) from the DOM tree of a html file. We show that the html files of splogs estimated to be created by an identical spammer tend to have similar DOM trees and this tendency is quite effective in splog detection.