Near-Duplicate mail detection based on URL information for spam filtering

Authors:
Chun-Chao Yeh;Chia-Hui Lin
Affiliations:
Department of Computer Science, National Taiwan Ocean University, Taiwan;Department of Computer Science, National Taiwan Ocean University, Taiwan
Venue:
ICOIN'06 Proceedings of the 2006 international conference on Information Networking: advances in Data Communications and Wireless Networks
Year:
2006

Citing 7
Cited 1

Collection statistics for fast duplicate document detection

ACM Transactions on Information Systems (TOIS)
On computer system challenges

Journal of the ACM (JACM)
Spam wars

Communications of the ACM - Program compaction
Winnowing: local algorithms for document fingerprinting

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Support vector machine active learning with applications to text classification

The Journal of Machine Learning Research
Boosting trees for clause splitting

ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Detecting near-duplicate SPITs in voice mailboxes using hashes

ISC'11 Proceedings of the 14th international conference on Information security

Quantified Score

Hi-index	0.00

Visualization

Abstract

Due to fast changing of spam techniques to evade being detected, we argue that multiple spam detection strategies should be developed to effectively against spam In literature, many proposed spam detection schemes used similar strategies based on supervised classification techniques such as naive Baysian, SVM, and K-NN But only few works were on the strategy using detection of duplicate copies In this paper, we propose a new duplicate-mail detection scheme based on similarity of mail context between incoming mails, especially the context of URL information We discuss different design strategies to against possible spam tricks to avoid being detected Also, We compared our approaches with four different approaches available in literature: Octet-based histogram method, I-Mach, Winnowing, and identical matching With over thousands of real mails we collected as testing data, our experiment results show that the proposed strategy outperforms the others Without considering compulsory miss, over 97% of near duplicate mails can be detected correctly.