A scalable intelligent non-content-based spam-filtering framework

  • Authors:
  • Yong Hu;Ce Guo;E. W. T. Ngai;Mei Liu;Shifeng Chen

  • Affiliations:
  • Institute of Business Intelligence & Knowledge Discovery, Department of E-commerce, Guangdong University of Foreign Studies, Sun Yat-Sen University, Guangzhou 510006, PR China;Institute of Business Intelligence & Knowledge Discovery, Department of E-commerce, Guangdong University of Foreign Studies, Sun Yat-Sen University, Guangzhou 510006, PR China;Department of Management and Marketing, The Hong Kong Polytechnic University, Kowloon, Hong Kong, PR China;Department of Biomedical Informatics, Vanderbilt University, Nashville, TN 37232, USA;Hanqi Network Technology Co., Ltd., Guangzhou 510665, PR China

  • Venue:
  • Expert Systems with Applications: An International Journal
  • Year:
  • 2010

Quantified Score

Hi-index 12.05

Visualization

Abstract

Designing a spam-filtering system that can run efficiently on heavily burdened servers is particularly important to the widely used email service providers (ESPs) (e.g., Hotmail, Yahoo, and Gmail) who have to deal with millions of emails everyday. Two primary challenges these companies face in spam filtering are efficiency and scalability. This study is undertaken to develop an efficient and scalable spam-filtering framework for heavily burdened email servers. We propose an Intelligent Hybrid Spam-Filtering Framework (IHSFF) to detect spam by analyzing only email headers. This framework is especially suitable for giant email servers because of its efficiency and scalability. The proposed filtering system may be deployed alone or in conjunction with other filters. We extract five features from the email header, namely ''originator field'', ''destination field'', ''X-Mailer field'', ''sender server IP address'' and ''mail subject''. Email subjects are digitalized using an algorithm based on n-grams for better performance. Moreover, using real-world data from a well-known ESP in China, we employ various machine-learning algorithms to test the model. Experimental results show that the framework using the Random Forest algorithm achieves good accuracy, recall, precision, and F-measure. With the addition of MetaCost framework, the model works stably well and incurs small costs in various cost-sensitive scenarios.