Making large-scale support vector machine learning practical
Advances in kernel methods
Feature selection for text categorization on imbalanced data
ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets
ICDAR '05 Proceedings of the Eighth International Conference on Document Analysis and Recognition
Do security toolbars actually prevent phishing attacks?
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A comparison of machine learning techniques for phishing detection
Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit
Behind phishing: an examination of phisher modi operandi
LEET'08 Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats
Identifying suspicious URLs: an application of large-scale online learning
ICML '09 Proceedings of the 26th Annual International Conference on Machine Learning
SMOTE: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Lexical feature based phishing URL detection using online learning
Proceedings of the 3rd ACM workshop on Artificial intelligence and security
Clustering potential phishing websites using DeepMD5
LEET'12 Proceedings of the 5th USENIX conference on Large-Scale Exploits and Emergent Threats
Proactive discovery of phishing related domain names
RAID'12 Proceedings of the 15th international conference on Research in Attacks, Intrusions, and Defenses
Hi-index | 0.00 |
Phishing sites have become a common approach to steal sensitive information, such as usernames, passwords and credit card details of the internet users. We propose a semisupervised machine learning approach to detect phishing URLs from a set of phishing and spam URLs. Spam emails are the source of these URLs. In reality, the number of phishing URLs received through these spam emails is fewer compared to other URLs. Our study is targeted to detect phishing URLs in a realistic scenario of a highly imbalanced data set containing phishing and spam URLs with 1:654 ratio. To train a learning algorithm labeled URLs are needed, where manual labeling is a common approach. Given that it is not feasible to manually label all the URLs from large data sets, we propose reducing manual intervention by labeling only 10% of the URLs manually and using a semisupervised learning algorithm. We compare the proposed approach with a supervised learning approach. Evaluation results show that our proposal is competitive if it is applied in combination with appropriate feature selection and undersampling techniques.