Effective spam filtering: A single-class learning and ensemble approach

Authors:
Chih-Ping Wei;Hsueh-Ching Chen;Tsang-Hsiang Cheng
Affiliations:
Institute of Technology Management, College of Technology Management, National Tsing Hua University, Hsinchu, Taiwan, ROC;Allion Computer Inc., No. 14, Lane 160, Fu Yang St., Taipei, Taiwan, ROC;Department of Business Administration, Southern Taiwan University, Tainan, Taiwan, ROC
Venue:
Decision Support Systems
Year:
2008

Citing 23
Cited 11

Classifying news stories using memory based reasoning

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
C4.5: programs for machine learning

C4.5: programs for machine learning
The nature of statistical learning theory

The nature of statistical learning theory
Stacked regressions

Machine Learning
Combining classifiers in text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Machine learning in automated text categorization

ACM Computing Surveys (CSUR)
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants

Machine Learning
Neural Network Ensembles

IEEE Transactions on Pattern Analysis and Machine Intelligence
Maximizing Text-Mining Performance

IEEE Intelligent Systems
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists

Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Partially Supervised Classification of Text Documents

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
PEBL: Web Page Classification without Negative Examples

IEEE Transactions on Knowledge and Data Engineering
A Comparison of Several Ensemble Methods for Text Categorization

SCC '04 Proceedings of the 2004 IEEE International Conference on Services Computing
Spam and the Social-Technical Gap

Computer
An evaluation of statistical spam filtering techniques

ACM Transactions on Asian Language Information Processing (TALIP)
Leveraging Social Networks to Fight Spam

Computer
A comparison of event models for Naive Bayes anti-spam e-mail filtering

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Content based SMS spam filtering

Proceedings of the 2006 ACM symposium on Document engineering
Spam and the ongoing battle for the inbox

Communications of the ACM - Spam and the ongoing battle for the inbox
Learning to classify texts using positive and unlabeled data

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal
A hybrid approach for efficient ensembles

Decision Support Systems
Commercial Internet filters: Perils and opportunities

Decision Support Systems
Cross-lingual text categorization: Conquering language boundaries in globalized environments

Information Processing and Management: an International Journal
Email shape analysis

ICDCN'10 Proceedings of the 11th international conference on Distributed computing and networking
A comparison of evaluation metrics for document filtering

CLEF'11 Proceedings of the Second international conference on Multilingual and multimodal information access evaluation
Automatic Moderation of Online Discussion Sites

International Journal of Electronic Commerce
A cost-sensitive technique for positive-example learning supporting content-based product recommendations in B-to-C e-commerce

Decision Support Systems
Exploring the disseminating behaviors of eWOM marketing: persuasion in online video

Electronic Commerce Research
The bank loan approval decision from multiple perspectives

Expert Systems with Applications: An International Journal
Exploiting poly-lingual documents for improving text categorization effectiveness

Decision Support Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The annoyance of spam emails increasingly plagues both individuals and organizations. In response, most of prior research investigates spam filtering as a classical text categorization task, in which training examples must include both spam (positive examples) and legitimate (negative examples) emails. However, in many spam filtering scenarios, obtaining legitimate emails for training purpose can be more difficult than collecting spam and unclassified emails. Hence, it is more appropriate to construct a classification model for spam filtering that uses positive training examples (i.e., spam) and unlabeled instances only and does not require legitimate emails as negative training examples. Several single-class learning techniques, such as PNB and PEBL, have been proposed in the literature. However, they incur inherent limitations with regard to spam filtering. In this study, we propose and develop an ensemble approach, referred to as E2, to address these limitations. Specifically, we follow the two-stage framework of PEBL but extend each stage with an ensemble strategy. The empirical evaluation results from two spam filtering corpora suggest that our proposed E2 technique generally outperforms benchmark techniques (i.e., PNB and PEBL) and exhibits more stable performance than its counterparts.