A scalable intelligent non-content-based spam-filtering framework

Authors:
Yong Hu;Ce Guo;E. W. T. Ngai;Mei Liu;Shifeng Chen
Affiliations:
Institute of Business Intelligence & Knowledge Discovery, Department of E-commerce, Guangdong University of Foreign Studies, Sun Yat-Sen University, Guangzhou 510006, PR China;Institute of Business Intelligence & Knowledge Discovery, Department of E-commerce, Guangdong University of Foreign Studies, Sun Yat-Sen University, Guangzhou 510006, PR China;Department of Management and Marketing, The Hong Kong Polytechnic University, Kowloon, Hong Kong, PR China;Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37232, USA;Hanqi Network Technology Co., Ltd., Guangzhou 510665, PR China
Venue:
Expert Systems with Applications: An International Journal
Year:
2010

Citing 15
Cited 4

MetaCost: a general method for making classifiers cost-sensitive

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Random Forests

Machine Learning
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Information Retrieval
A Comparative Study of Classification Based Personal E-mail Filtering

PADKK '00 Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Current Issues and New Applications
Identifying spam without peeking at the contents

Crossroads
Applying lazy learning algorithms to tackle concept drift in spam filtering

Expert Systems with Applications: An International Journal
Learning to classify e-mail

Information Sciences: an International Journal
Artificial immune system inspired behavior-based anti-spam filter

Soft Computing - A Fusion of Foundations, Methodologies and Applications - Web intelligence and change discovery
An empirical study of three machine learning methods for spam filtering

Knowledge-Based Systems
Workload models of spam and legitimate e-mails

Performance Evaluation
On the properties of spam-advertised URL addresses

Journal of Network and Computer Applications
A mailbox ownership based mechanism for curbing spam

Computer Communications
Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks

Expert Systems with Applications: An International Journal
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Segmental parameterisation and statistical modelling of e-mail headers for spam detection

Information Sciences: an International Journal
Grindstone4Spam: An optimization toolkit for boosting e-mail classification

Journal of Systems and Software
Identifying spam e-mail based-on statistical header features and sender behavior

Proceedings of the CUBE International Information Technology Conference
Hybrid email spam detection model with negative selection algorithm and differential evolution

Engineering Applications of Artificial Intelligence

Quantified Score

Hi-index	12.05

Visualization

Abstract

Designing a spam-filtering system that can run efficiently on heavily burdened servers is particularly important to the widely used email service providers (ESPs) (e.g., Hotmail, Yahoo, and Gmail) who have to deal with millions of emails everyday. Two primary challenges these companies face in spam filtering are efficiency and scalability. This study is undertaken to develop an efficient and scalable spam-filtering framework for heavily burdened email servers. We propose an Intelligent Hybrid Spam-Filtering Framework (IHSFF) to detect spam by analyzing only email headers. This framework is especially suitable for giant email servers because of its efficiency and scalability. The proposed filtering system may be deployed alone or in conjunction with other filters. We extract five features from the email header, namely ''originator field'', ''destination field'', ''X-Mailer field'', ''sender server IP address'' and ''mail subject''. Email subjects are digitalized using an algorithm based on n-grams for better performance. Moreover, using real-world data from a well-known ESP in China, we employ various machine-learning algorithms to test the model. Experimental results show that the framework using the Random Forest algorithm achieves good accuracy, recall, precision, and F-measure. With the addition of MetaCost framework, the model works stably well and incurs small costs in various cost-sensitive scenarios.