Collection statistics for fast duplicate document detection
ACM Transactions on Information Systems (TOIS)
Journal of the ACM (JACM)
Communications of the ACM - Program compaction
Winnowing: local algorithms for document fingerprinting
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Support vector machine active learning with applications to text classification
The Journal of Machine Learning Research
Boosting trees for clause splitting
ConLL '01 Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7
Support vector machines for spam categorization
IEEE Transactions on Neural Networks
Detecting near-duplicate SPITs in voice mailboxes using hashes
ISC'11 Proceedings of the 14th international conference on Information security
Hi-index | 0.00 |
Due to fast changing of spam techniques to evade being detected, we argue that multiple spam detection strategies should be developed to effectively against spam In literature, many proposed spam detection schemes used similar strategies based on supervised classification techniques such as naive Baysian, SVM, and K-NN But only few works were on the strategy using detection of duplicate copies In this paper, we propose a new duplicate-mail detection scheme based on similarity of mail context between incoming mails, especially the context of URL information We discuss different design strategies to against possible spam tricks to avoid being detected Also, We compared our approaches with four different approaches available in literature: Octet-based histogram method, I-Mach, Winnowing, and identical matching With over thousands of real mails we collected as testing data, our experiment results show that the proposed strategy outperforms the others Without considering compulsory miss, over 97% of near duplicate mails can be detected correctly.