Evaluating a semisupervised approach to phishing url identification in a realistic scenario

  • Authors:
  • Binod Gyawali;Thamar Solorio;Manuel Montes-y-Gómez;Bradley Wardman;Gary Warner

  • Affiliations:
  • University of Alabama at Birmingham, Birmingham, Alabama;University of Alabama at Birmingham, Birmingham, Alabama;University of Alabama at Birmingham, Birmingham, Alabama;University of Alabama at Birmingham, Birmingham, Alabama;University of Alabama at Birmingham, Birmingham, Alabama

  • Venue:
  • Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Phishing sites have become a common approach to steal sensitive information, such as usernames, passwords and credit card details of the internet users. We propose a semisupervised machine learning approach to detect phishing URLs from a set of phishing and spam URLs. Spam emails are the source of these URLs. In reality, the number of phishing URLs received through these spam emails is fewer compared to other URLs. Our study is targeted to detect phishing URLs in a realistic scenario of a highly imbalanced data set containing phishing and spam URLs with 1:654 ratio. To train a learning algorithm labeled URLs are needed, where manual labeling is a common approach. Given that it is not feasible to manually label all the URLs from large data sets, we propose reducing manual intervention by labeling only 10% of the URLs manually and using a semisupervised learning algorithm. We compare the proposed approach with a supervised learning approach. Evaluation results show that our proposal is competitive if it is applied in combination with appropriate feature selection and undersampling techniques.