Evaluating a semisupervised approach to phishing url identification in a realistic scenario

Authors:
Binod Gyawali;Thamar Solorio;Manuel Montes-y-Gómez;Bradley Wardman;Gary Warner
Affiliations:
University of Alabama at Birmingham, Birmingham, Alabama;University of Alabama at Birmingham, Birmingham, Alabama;University of Alabama at Birmingham, Birmingham, Alabama;University of Alabama at Birmingham, Birmingham, Alabama;University of Alabama at Birmingham, Birmingham, Alabama
Venue:
Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Year:
2011

Citing 9
Cited 2

Making large-scale support vector machine learning practical

Advances in kernel methods
Feature selection for text categorization on imbalanced data

ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
Phishing Webpage Detection

ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Do security toolbars actually prevent phishing attacks?

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A comparison of machine learning techniques for phishing detection

Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit
Behind phishing: an examination of phisher modi operandi

LEET'08 Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats
Identifying suspicious URLs: an application of large-scale online learning

ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Lexical feature based phishing URL detection using online learning

Proceedings of the 3rd ACM workshop on Artificial intelligence and security

Clustering potential phishing websites using DeepMD5

LEET'12 Proceedings of the 5th USENIX conference on Large-Scale Exploits and Emergent Threats
Proactive discovery of phishing related domain names

RAID'12 Proceedings of the 15th international conference on Research in Attacks, Intrusions, and Defenses

Quantified Score

Hi-index	0.00

Visualization

Abstract

Phishing sites have become a common approach to steal sensitive information, such as usernames, passwords and credit card details of the internet users. We propose a semisupervised machine learning approach to detect phishing URLs from a set of phishing and spam URLs. Spam emails are the source of these URLs. In reality, the number of phishing URLs received through these spam emails is fewer compared to other URLs. Our study is targeted to detect phishing URLs in a realistic scenario of a highly imbalanced data set containing phishing and spam URLs with 1:654 ratio. To train a learning algorithm labeled URLs are needed, where manual labeling is a common approach. Given that it is not feasible to manually label all the URLs from large data sets, we propose reducing manual intervention by labeling only 10% of the URLs manually and using a semisupervised learning algorithm. We compare the proposed approach with a supervised learning approach. Evaluation results show that our proposal is competitive if it is applied in combination with appropriate feature selection and undersampling techniques.