An incremental cluster-based approach to spam filtering

Authors:
Wen-Feng Hsiao;Te-Min Chang
Affiliations:
Department of Information Management, National Pingtung Institute of Commerce, Taiwan;Department of Information Management, National Sun Yat-sen University, Taiwan
Venue:
Expert Systems with Applications: An International Journal
Year:
2008

Citing 10
Cited 6

Data clustering: a review

ACM Computing Surveys (CSUR)
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Text Categorization Based on Regularized Linear Classification Methods

Information Retrieval
Distributed Data Mining in Credit Card Fraud Detection

IEEE Intelligent Systems
A refinement approach to handling model misfit in text categorization

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
"In vivo" spam filtering: a challenge problem for KDD

ACM SIGKDD Explorations Newsletter
Information gain and divergence-based feature selection for machine learning-based text categorization

Information Processing and Management: an International Journal - Special issue: Formal methods for information retrieval
Tackling concept drift by temporal inductive transfer

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
The condensed nearest neighbor rule (Corresp.)

IEEE Transactions on Information Theory
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Detection of cloaked web spam by using tag-based methods

Expert Systems with Applications: An International Journal
Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal
Using GMDH-based networks for improved spam detection and email feature analysis

Applied Soft Computing
Facing the spammers: A very effective approach to avoid junk e-mails

Expert Systems with Applications: An International Journal
Classification of textual E-mail spam using data mining techniques

Applied Computational Intelligence and Soft Computing
Concept drift detection via competence models

Artificial Intelligence

Quantified Score

Hi-index	12.06

Visualization

Abstract

As email becomes a popular means for communication over the Internet, the problem of receiving unsolicited and undesired emails, called spam or junk mails, severely arises. To filter spam from legitimate emails, automatic classification approaches using text mining techniques are proposed. This kind of approaches, however, often suffers from low recall rate due to the natures of spam, skewed class distributions and concept drift. This research is thus to propose an appropriate classification approach to alleviating the problems of skewed class distributions and drifting concepts. A cluster-based classification method, called ICBC, is developed accordingly. ICBC contains two phases. In the first phase, it clusters emails in each given class into several groups, and an equal number of features (keywords) are extracted from each group to manifest the features in the minority class. In the second phase, we capacitate ICBC with an incremental learning mechanism that can adapt itself to accommodate the changes of the environment in a fast and low-cost manner. Three experiments are conducted to evaluate the performance of ICBC. The results show that ICBC can effectively deal with the issues of skewed and changing class distributions, and its incremental learning can also reduce the cost of re-training. The feasibility of the proposed approach is thus justified.