Effective spam filtering: A single-class learning and ensemble approach

  • Authors:
  • Chih-Ping Wei;Hsueh-Ching Chen;Tsang-Hsiang Cheng

  • Affiliations:
  • Institute of Technology Management, College of Technology Management, National Tsing Hua University, Hsinchu, Taiwan, ROC;Allion Computer Inc., No. 14, Lane 160, Fu Yang St., Taipei, Taiwan, ROC;Department of Business Administration, Southern Taiwan University, Tainan, Taiwan, ROC

  • Venue:
  • Decision Support Systems
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

The annoyance of spam emails increasingly plagues both individuals and organizations. In response, most of prior research investigates spam filtering as a classical text categorization task, in which training examples must include both spam (positive examples) and legitimate (negative examples) emails. However, in many spam filtering scenarios, obtaining legitimate emails for training purpose can be more difficult than collecting spam and unclassified emails. Hence, it is more appropriate to construct a classification model for spam filtering that uses positive training examples (i.e., spam) and unlabeled instances only and does not require legitimate emails as negative training examples. Several single-class learning techniques, such as PNB and PEBL, have been proposed in the literature. However, they incur inherent limitations with regard to spam filtering. In this study, we propose and develop an ensemble approach, referred to as E2, to address these limitations. Specifically, we follow the two-stage framework of PEBL but extend each stage with an ensemble strategy. The empirical evaluation results from two spam filtering corpora suggest that our proposed E2 technique generally outperforms benchmark techniques (i.e., PNB and PEBL) and exhibits more stable performance than its counterparts.