Large-scale bot detection for search engines

Authors:
Hongwen Kang;Kuansan Wang;David Soukal;Fritz Behr;Zijian Zheng
Affiliations:
Carnegie Mellon University, Pittsburgh, PA, USA;Microsoft Research, Redmond, WA, USA;Microsoft, Redmond, WA, USA;Microsoft, Redmond, WA, USA;Microsoft, Redmond, WA, USA
Venue:
Proceedings of the 19th international conference on World wide web
Year:
2010

Citing 30
Cited 4

A tutorial on hidden Markov models and selected applications in speech recognition

Readings in speech recognition
C4.5: programs for machine learning

C4.5: programs for machine learning
Ethical Web agents

Computer Networks and ISDN Systems
Bayesian Network Classifiers

Machine Learning - Special issue on learning with probabilistic representations
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Semi-supervised support vector machines

Proceedings of the 1998 conference on Advances in neural information processing systems II
Web search behavior of Internet experts and newbies

Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Analyzing the effectiveness and applicability of co-training

Proceedings of the ninth international conference on Information and knowledge management
An introduction to hidden Markov models and Bayesian networks

Hidden Markov models
Data mining: practical machine learning tools and techniques with Java implementations

ACM SIGMOD Record
Discovery of Web Robot Sessions Based on their Navigational Patterns

Data Mining and Knowledge Discovery
Partially Supervised Classification of Text Documents

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Enhancing Supervised Learning with Unlabeled Data

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Unsupervised word sense disambiguation rivaling supervised methods

ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Semi-Supervised Self-Training of Object Detection Models

WACV-MOTION '05 Proceedings of the Seventh IEEE Workshops on Application of Computer Vision (WACV/MOTION'05) - Volume 1 - Volume 01
Modeling User Search Behavior

LA-WEB '05 Proceedings of the Third Latin American Web Congress
Learning subjective nouns using extraction pattern bootstrapping

CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Search worms

Proceedings of the 4th ACM workshop on Recurring malcode
Semi-Supervised Learning (Adaptive Computation and Machine Learning)

Semi-Supervised Learning (Adaptive Computation and Machine Learning)
The anatomy of Clickbot.A

HotBots'07 Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets
Learning to rank for information retrieval (LR4IR 2007)

ACM SIGIR Forum
Characterizing typical and atypical user sessions in clickstreams

Proceedings of the 17th international conference on World Wide Web
A large-scale study of automated web search traffic

AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Are click-through data adequate for learning web search rankings?

Proceedings of the 17th ACM conference on Information and knowledge management
Web robot detection: A probabilistic reasoning approach

Computer Networks: The International Journal of Computer and Telecommunications Networking
BotGraph: large scale spamming botnet detection

NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Learning from labeled and unlabeled data: an empirical study across techniques and domains

Journal of Artificial Intelligence Research
CAPTCHA: using hard AI problems for security

EUROCRYPT'03 Proceedings of the 22nd international conference on Theory and applications of cryptographic techniques
Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning

Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning

Query suggestion for E-commerce sites

Proceedings of the fourth ACM international conference on Web search and data mining
What's clicking what? techniques and innovations of today's clickbots

DIMVA'11 Proceedings of the 8th international conference on Detection of intrusions and malware, and vulnerability assessment
Spotting opinion spammers using behavioral footprints

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Search engine click spam detection based on bipartite graph propagation

Proceedings of the 7th ACM international conference on Web search and data mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a semi-supervised learning approach for classifying program (bot) generated web search traffic from that of genuine human users. The work is motivated by the challenge that the enormous amount of search data pose to traditional approaches that rely on fully annotated training samples. We propose a semi-supervised framework that addresses the problem in multiple fronts. First, we use the CAPTCHA technique and simple heuristics to extract from the data logs a large set of training samples with initial labels, though directly using these training data is problematic because the data thus sampled are biased. To tackle this problem, we further develop a semi-supervised learning algorithm to take advantage of the unlabeled data to improve the classification performance. These two proposed algorithms can be seamlessly combined and very cost efficient to scale the training process. In our experiment, the proposed approach showed significant (i.e. 2:1) improvement compared to the traditional supervised approach.