Comparing similarity of HTML structures and affiliate IDs in splog analysis

Authors:
Taichi Katayama;Akihito Morijiri;Soichi Ishii;Takehito Utsuro;Yasuhide Kawada;Tomohiro Fukuhara
Affiliations:
University of Tsukuba, Tsukuba, Japan;University of Tsukuba, Tsukuba, Japan;Tokyo Denki University, Tokyo, Japan;University of Tsukuba, Tsukuba, Japan;Navix Co., Ltd., Tokyo, Japan;National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
Venue:
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
Year:
2011

Citing 5
Cited 0

Spam double-funnel: connecting web spammers with advertisers

Proceedings of the 16th international conference on World Wide Web
Splog detection using self-similarity analysis on blog temporal dynamics

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
KANSHIN: A Cross-Lingual Concern Analysis System Using Multilingual Blog Articles

INGS '08 Proceedings of the 2008 International Workshop on Information-Explosion and Next Generation Search
Analysing features of Japanese splogs and characteristics of keywords

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Detecting splogs using similarities of splog HTML structures

Proceedings of the 4th International Conference on Uniquitous Information Management and Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spam blogs or splogs are blogs hosting spam posts, created using machine generated or hijacked content for the sole purpose of hosting advertisements or raising the number of in-links of target sites. Among those splogs, this paper focuses on detecting a group of splogs which are estimated to be created by an identical spammer. In this paper, we compare two clues: namely, similarity of HTML structures of splogs and affiliate IDs automatically extracted from splogs. We first show that the similarity of HTML structures of splogs is quite effective in splog detection, as well as in identifying spammers. We then show that the identity of affiliate IDs extracted from splogs can identify spammers much more directly than similarity of HTML structures, although it is not easy to achieve high coverage in extracting affiliate IDs. Finally, we show that the coverage of the intersection of the two clues, similarity of HTML structures and affiliate IDs, is relatively low, and it is necessary to apply them in a complementary strategy.