A tutorial on hidden Markov models and selected applications in speech recognition
Readings in speech recognition
C4.5: programs for machine learning
C4.5: programs for machine learning
Computer Networks and ISDN Systems
Machine Learning - Special issue on learning with probabilistic representations
Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
Semi-supervised support vector machines
Proceedings of the 1998 conference on Advances in neural information processing systems II
Web search behavior of Internet experts and newbies
Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking
Analyzing the effectiveness and applicability of co-training
Proceedings of the ninth international conference on Information and knowledge management
An introduction to hidden Markov models and Bayesian networks
Hidden Markov models
Discovery of Web Robot Sessions Based on their Navigational Patterns
Data Mining and Knowledge Discovery
Partially Supervised Classification of Text Documents
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Transductive Inference for Text Classification using Support Vector Machines
ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Enhancing Supervised Learning with Unlabeled Data
ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Unsupervised word sense disambiguation rivaling supervised methods
ACL '95 Proceedings of the 33rd annual meeting on Association for Computational Linguistics
Semi-Supervised Self-Training of Object Detection Models
WACV-MOTION '05 Proceedings of the Seventh IEEE Workshops on Application of Computer Vision (WACV/MOTION'05) - Volume 1 - Volume 01
LA-WEB '05 Proceedings of the Third Latin American Web Congress
Learning subjective nouns using extraction pattern bootstrapping
CONLL '03 Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4
Proceedings of the 4th ACM workshop on Recurring malcode
Semi-Supervised Learning (Adaptive Computation and Machine Learning)
Semi-Supervised Learning (Adaptive Computation and Machine Learning)
HotBots'07 Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets
Learning to rank for information retrieval (LR4IR 2007)
ACM SIGIR Forum
Characterizing typical and atypical user sessions in clickstreams
Proceedings of the 17th international conference on World Wide Web
A large-scale study of automated web search traffic
AIRWeb '08 Proceedings of the 4th international workshop on Adversarial information retrieval on the web
Are click-through data adequate for learning web search rankings?
Proceedings of the 17th ACM conference on Information and knowledge management
Web robot detection: A probabilistic reasoning approach
Computer Networks: The International Journal of Computer and Telecommunications Networking
BotGraph: large scale spamming botnet detection
NSDI'09 Proceedings of the 6th USENIX symposium on Networked systems design and implementation
Learning from labeled and unlabeled data: an empirical study across techniques and domains
Journal of Artificial Intelligence Research
CAPTCHA: using hard AI problems for security
EUROCRYPT'03 Proceedings of the 22nd international conference on Theory and applications of cryptographic techniques
Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning
Query suggestion for E-commerce sites
Proceedings of the fourth ACM international conference on Web search and data mining
What's clicking what? techniques and innovations of today's clickbots
DIMVA'11 Proceedings of the 8th international conference on Detection of intrusions and malware, and vulnerability assessment
Spotting opinion spammers using behavioral footprints
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Search engine click spam detection based on bipartite graph propagation
Proceedings of the 7th ACM international conference on Web search and data mining
Hi-index | 0.00 |
In this paper, we propose a semi-supervised learning approach for classifying program (bot) generated web search traffic from that of genuine human users. The work is motivated by the challenge that the enormous amount of search data pose to traditional approaches that rely on fully annotated training samples. We propose a semi-supervised framework that addresses the problem in multiple fronts. First, we use the CAPTCHA technique and simple heuristics to extract from the data logs a large set of training samples with initial labels, though directly using these training data is problematic because the data thus sampled are biased. To tackle this problem, we further develop a semi-supervised learning algorithm to take advantage of the unlabeled data to improve the classification performance. These two proposed algorithms can be seamlessly combined and very cost efficient to scale the training process. In our experiment, the proposed approach showed significant (i.e. 2:1) improvement compared to the traditional supervised approach.