Selective Sampling Using the Query by Committee Algorithm
Machine Learning
Some label efficient learning results
COLT '97 Proceedings of the tenth annual conference on Computational learning theory
Large Margin Classification Using the Perceptron Algorithm
Machine Learning - The Eleventh Annual Conference on computational Learning Theory
The Perceptron Algorithm with Uneven Margins
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
Support vector machine active learning with applications to text classification
The Journal of Machine Learning Research
A new approximate maximal margin classification algorithm
The Journal of Machine Learning Research
Ultraconservative online algorithms for multiclass problems
The Journal of Machine Learning Research
Fast webpage classification using URL features
Proceedings of the 14th ACM international conference on Information and knowledge management
Detecting spam web pages through content analysis
Proceedings of the 15th international conference on World Wide Web
Batch mode active learning and its application to medical image classification
ICML '06 Proceedings of the 23rd international conference on Machine learning
Cantina: a content-based approach to detecting phishing web sites
Proceedings of the 16th international conference on World Wide Web
Online Passive-Aggressive Algorithms
The Journal of Machine Learning Research
Worst-Case Analysis of Selective Sampling for Linear Classification
The Journal of Machine Learning Research
Spamming botnets: signatures and characteristics
Proceedings of the ACM SIGCOMM 2008 conference on Data communication
Highly predictive blacklisting
SS'08 Proceedings of the 17th conference on Security symposium
Semisupervised SVM batch mode active learning with applications to image retrieval
ACM Transactions on Information Systems (TOIS)
A hybrid phish detection approach by identity discovery and keywords retrieval
Proceedings of the 18th international conference on World wide web
Purely URL-based topic classification
Proceedings of the 18th international conference on World wide web
Identifying suspicious URLs: an application of large-scale online learning
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Beyond blacklists: learning to detect malicious web sites from suspicious URLs
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Analysis of Perceptron-Based Active Learning
The Journal of Machine Learning Research
COLT'07 Proceedings of the 20th annual conference on Learning theory
Learning to detect malicious URLs
ACM Transactions on Intelligent Systems and Technology (TIST)
Detecting malicious web links and identifying their attack types
WebApps'11 Proceedings of the 2nd USENIX conference on Web application development
Double Updating Online Learning
The Journal of Machine Learning Research
Maximum Margin/Volume Outlier Detection
ICTAI '11 Proceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence
On the generalization ability of on-line learning algorithms
IEEE Transactions on Information Theory
Minimizing regret with label efficient prediction
IEEE Transactions on Information Theory
Cost-Sensitive Online Classification
ICDM '12 Proceedings of the 2012 IEEE 12th International Conference on Data Mining
Hi-index | 0.00 |
Malicious Uniform Resource Locator (URL) detection is an important problem in web search and mining, which plays a critical role in internet security. In literature, many existing studies have attempted to formulate the problem as a regular supervised binary classification task, which typically aims to optimize the prediction accuracy. However, in a real-world malicious URL detection task, the ratio between the number of malicious URLs and legitimate URLs is highly imbalanced, making it very inappropriate for simply optimizing the prediction accuracy. Besides, another key limitation of the existing work is to assume a large amount of training data is available, which is impractical as the human labeling cost could be potentially quite expensive. To solve these issues, in this paper, we present a novel framework of Cost-Sensitive Online Active Learning (CSOAL), which only queries a small fraction of training data for labeling and directly optimizes two cost-sensitive measures to address the class-imbalance issue. In particular, we propose two CSOAL algorithms and analyze their theoretical performance in terms of cost-sensitive bounds. We conduct an extensive set of experiments to examine the empirical performance of the proposed algorithms for a large-scale challenging malicious URL detection task, in which the encouraging results showed that the proposed technique by querying an extremely small-sized labeled data (about 0.5% out of 1-million instances) can achieve better or highly comparable classification performance in comparison to the state-of-the-art cost-insensitive and cost-sensitive online classification algorithms using a huge amount of labeled data.