An empirical study on selective sampling in active learning for splog detection

Authors:
Taichi Katayama;Takehito Utsuro;Yuuki Sato;Takayuki Yoshinaka;Yasuhide Kawada;Tomohiro Fukuhara
Affiliations:
University of Tsukuba, Tsukuba, Japan;University of Tsukuba, Tsukuba, Japan;University of Tsukuba, Tsukuba, Japan;Tokyo Denki University, Tokyo, Japan;Navix Co., Ltd., Tokyo, Japan;University of Tokyo, Kashiwa, Japan
Venue:
Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
Year:
2009

Citing 8
Cited 3

A sequential algorithm for training text classifiers

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
The nature of statistical learning theory

The nature of statistical learning theory
Less is More: Active Learning with Support Vector Machines

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Support Vector Machine Active Learning with Application sto Text Classification

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Automatically collecting, monitoring, and mining japanese weblogs

Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters
Spam double-funnel: connecting web spammers with advertisers

Proceedings of the 16th international conference on World Wide Web
Splog detection using self-similarity analysis on blog temporal dynamics

AIRWeb '07 Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
Analysing features of Japanese splogs and characteristics of keywords

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web

Detecting spam blogs from blog search results

Information Processing and Management: an International Journal
Adversarial Web Search

Foundations and Trends in Information Retrieval
Detecting splogs using similarities of splog HTML structures

Proceedings of the 4th International Conference on Uniquitous Information Management and Communication

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper studies how to reduce the amount of human supervision for identifying splogs / authentic blogs in the context of continuously updating splog data sets year by year. Following the previous works on active learning, against the task of splog / authentic blog detection, this paper empirically examines several strategies for selective sampling in active learning by Support Vector Machines (SVMs). As a confidence measure of SVMs learning, we employ the distance from the separating hyperplane to each test instance, which have been well studied in active learning for text classification. Unlike those results of applying active learning to text classification tasks, in the task of splog / authentic blog detection of this paper, it is not the case that adding least confident samples peforms best.