Near-Duplicate mail detection based on URL information for spam filtering

  • Authors:
  • Chun-Chao Yeh;Chia-Hui Lin

  • Affiliations:
  • Department of Computer Science, National Taiwan Ocean University, Taiwan;Department of Computer Science, National Taiwan Ocean University, Taiwan

  • Venue:
  • ICOIN'06 Proceedings of the 2006 international conference on Information Networking: advances in Data Communications and Wireless Networks
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Due to fast changing of spam techniques to evade being detected, we argue that multiple spam detection strategies should be developed to effectively against spam In literature, many proposed spam detection schemes used similar strategies based on supervised classification techniques such as naive Baysian, SVM, and K-NN But only few works were on the strategy using detection of duplicate copies In this paper, we propose a new duplicate-mail detection scheme based on similarity of mail context between incoming mails, especially the context of URL information We discuss different design strategies to against possible spam tricks to avoid being detected Also, We compared our approaches with four different approaches available in literature: Octet-based histogram method, I-Mach, Winnowing, and identical matching With over thousands of real mails we collected as testing data, our experiment results show that the proposed strategy outperforms the others Without considering compulsory miss, over 97% of near duplicate mails can be detected correctly.