Identifying suspicious URLs: an application of large-scale online learning

Authors:
Justin Ma;Lawrence K. Saul;Stefan Savage;Geoffrey M. Voelker
Affiliations:
UC San Diego, La Jolla, CA;UC San Diego, La Jolla, CA;UC San Diego, La Jolla, CA;UC San Diego, La Jolla, CA
Venue:
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
Year:
2009

Citing 11
Cited 28

On-line learning and stochastic approximations

On-line learning in neural networks
Learning to detect phishing emails

Proceedings of the 16th international conference on World Wide Web
Online Passive-Aggressive Algorithms

The Journal of Machine Learning Research
A framework for detection and measurement of phishing attacks

Proceedings of the 2007 ACM workshop on Recurring malcode
The ghost in the browser analysis of web-based malware

HotBots'07 Proceedings of the first conference on First Workshop on Hot Topics in Understanding Botnets
The Forgetron: A Kernel-Based Perceptron on a Budget

SIAM Journal on Computing
Behind phishing: an examination of phisher modi operandi

LEET'08 Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats
Confidence-weighted linear classification

Proceedings of the 25th international conference on Machine learning
The projectron: a bounded kernel-based Perceptron

Proceedings of the 25th international conference on Machine learning
All your iFRAMEs point to Us

SS'08 Proceedings of the 17th conference on Security symposium
Beyond blacklists: learning to detect malicious web sites from suspicious URLs

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Beyond blacklists: learning to detect malicious web sites from suspicious URLs

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Identifying spam link generators for monitoring emerging web spam

Proceedings of the 4th workshop on Information credibility
Beyond online aggregation: parallel and incremental data mining with online Map-Reduce

Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud
Lexical feature based phishing URL detection using online learning

Proceedings of the 3rd ACM workshop on Artificial intelligence and security
SKIF: a data imputation framework for concept drifting data streams

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Learning to detect malicious URLs

ACM Transactions on Intelligent Systems and Technology (TIST)
Facebook immune system

Proceedings of the 4th Workshop on Social Network Systems
Detecting malicious web links and identifying their attack types

WebApps'11 Proceedings of the 2nd USENIX conference on Web application development
Trading representability for scalability: adaptive multi-hyperplane machine for nonlinear classification

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Enabling fast prediction for ensemble models on data streams

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Detecting bots via incremental LS-SVM learning with dynamic feature adaptation

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Bayesian approach to the pattern recognition problem in nonstationary environment

PReMI'11 Proceedings of the 4th international conference on Pattern recognition and machine intelligence
deSEO: combating search-result poisoning

SEC'11 Proceedings of the 20th USENIX conference on Security
Evaluating a semisupervised approach to phishing url identification in a realistic scenario

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Asynchronous peer-to-peer data mining with stochastic gradient descent

Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Judging a site by its content: learning the textual, structural, and visual features of malicious web pages

Proceedings of the 4th ACM workshop on Security and artificial intelligence
Spam filtering in twitter using sender-receiver relationship

RAID'11 Proceedings of the 14th international conference on Recent Advances in Intrusion Detection
PKI as part of an integrated risk management strategy for web security

EuroPKI'11 Proceedings of the 8th European conference on Public Key Infrastructures, Services, and Applications
Confidence-weighted linear classification for text categorization

The Journal of Machine Learning Research
A kernel fused perceptron for the online classification of large-scale data

Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications
Proactive discovery of phishing related domain names

RAID'12 Proceedings of the 15th international conference on Research in Attacks, Intrusions, and Defenses
Fluxing botnet command and control channels with URL shortening services

Computer Communications
Cross-layer detection of malicious websites

Proceedings of the third ACM conference on Data and application security and privacy
Malicious URL Detection Based on Kolmogorov Complexity Estimation

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Identification of potential malicious web pages

AISC '11 Proceedings of the Ninth Australasian Information Security Conference - Volume 116
Cost-sensitive online active learning with application to malicious URL detection

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
Effective analysis, characterization, and detection of malicious web pages

Proceedings of the 22nd international conference on World Wide Web companion
Anatomy of drive-by download attack

AISC '13 Proceedings of the Eleventh Australasian Information Security Conference - Volume 138

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper explores online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. We show that this application is particularly appropriate for online algorithms as the size of the training data is larger than can be efficiently processed in batch and because the distribution of features that typify malicious URLs is changing continuously. Using a real-time system we developed for gathering URL features, combined with a real-time source of labeled URLs from a large Web mail provider, we demonstrate that recently-developed online algorithms can be as accurate as batch techniques, achieving classification accuracies up to 99% over a balanced data set.