Cost-sensitive online active learning with application to malicious URL detection

Authors:
Peilin Zhao;Steven C.H. Hoi
Affiliations:
Nanyang Technological University, Singapore, Singapore;Nanyang Technological University, Singapore, Singapore
Venue:
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2013

Citing 29
Cited 0

Selective Sampling Using the Query by Committee Algorithm

Machine Learning
Some label efficient learning results

COLT '97 Proceedings of the tenth annual conference on Computational learning theory
Large Margin Classification Using the Perceptron Algorithm

Machine Learning - The Eleventh Annual Conference on computational Learning Theory
The Perceptron Algorithm with Uneven Margins

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Support vector machine active learning with applications to text classification

The Journal of Machine Learning Research
A new approximate maximal margin classification algorithm

The Journal of Machine Learning Research
Ultraconservative online algorithms for multiclass problems

The Journal of Machine Learning Research
Fast webpage classification using URL features

Proceedings of the 14th ACM international conference on Information and knowledge management
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Batch mode active learning and its application to medical image classification

ICML '06 Proceedings of the 23rd international conference on Machine learning
Cantina: a content-based approach to detecting phishing web sites

Proceedings of the 16th international conference on World Wide Web
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
Worst-Case Analysis of Selective Sampling for Linear Classification

The Journal of Machine Learning Research
Spamming botnets: signatures and characteristics

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Highly predictive blacklisting

SS'08 Proceedings of the 17th conference on Security symposium
Semisupervised SVM batch mode active learning with applications to image retrieval

ACM Transactions on Information Systems (TOIS)
A hybrid phish detection approach by identity discovery and keywords retrieval

Proceedings of the 18th international conference on World wide web
Purely URL-based topic classification

Proceedings of the 18th international conference on World wide web
Identifying suspicious URLs: an application of large-scale online learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Beyond blacklists: learning to detect malicious web sites from suspicious URLs

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Analysis of Perceptron-Based Active Learning

The Journal of Machine Learning Research
Margin based active learning

COLT'07 Proceedings of the 20th annual conference on Learning theory
Learning to detect malicious URLs

ACM Transactions on Intelligent Systems and Technology (TIST)
Detecting malicious web links and identifying their attack types

WebApps'11 Proceedings of the 2nd USENIX conference on Web application development
Double Updating Online Learning

The Journal of Machine Learning Research
Maximum Margin/Volume Outlier Detection

ICTAI '11 Proceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence
On the generalization ability of on-line learning algorithms

IEEE Transactions on Information Theory
Minimizing regret with label efficient prediction

IEEE Transactions on Information Theory
Cost-Sensitive Online Classification

ICDM '12 Proceedings of the 2012 IEEE 12th International Conference on Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

Malicious Uniform Resource Locator (URL) detection is an important problem in web search and mining, which plays a critical role in internet security. In literature, many existing studies have attempted to formulate the problem as a regular supervised binary classification task, which typically aims to optimize the prediction accuracy. However, in a real-world malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy. Besides, another key limitation of the existing work is to assume a large amount of training data is available, which is impractical as the human labeling cost could be potentially quite expensive. To solve these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning (CSOAL), which only queries a small fraction of training data for labeling and directly optimizes two cost-sensitive measures to address the class-imbalance issue. In particular, we propose two CSOAL algorithms and analyze their theoretical performance in terms of cost-sensitive bounds. We conduct an extensive set of experiments to examine the empirical performance of the proposed algorithms for a large-scale challenging malicious URL detection task, in which the encouraging results showed that the proposed technique by querying an extremely small-sized labeled data (about 0.5% out of 1-million instances) can achieve better or highly comparable classification performance in comparison to the state-of-the-art cost-insensitive and cost-sensitive online classification algorithms using a huge amount of labeled data.