Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering

Authors:
Sarah Jane Delany;Derek Bridge
Affiliations:
Dublin Institute of Technology, Dublin, Ireland;University College Cork, Cork, Ireland
Venue:
ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Year:
2007

Citing 26
Cited 4

Instance-Based Learning Algorithms

Machine Learning
Generalizing from case studies: a case study

ML92 Proceedings of the ninth international workshop on Machine learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Learning in the presence of concept drift and hidden contexts

Machine Learning
Tolerating Concept and Sampling Shift in Lazy Learning UsingPrediction Error Context Switching

Artificial Intelligence Review - Special issue on lazy learning
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
Advances in Instance Selection for Instance-Based Learning Algorithms

Data Mining and Knowledge Discovery
The similarity metric

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Adapting to Drift in Continuous Domains (Extended Abstract)

ECML '95 Proceedings of the 8th European Conference on Machine Learning
Detecting Concept Drift with Support Vector Machines

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Using k-d Trees to Improve the Retrieval Step in Case-Based Reasoning

EWCBR '93 Selected papers from the First European Workshop on Topics in Case-Based Reasoning
Fish and Shrink. A Next Step Towards Efficient Case Retrieval in Large-Scale Case Bases

EWCBR '96 Proceedings of the Third European Workshop on Advances in Case-Based Reasoning
Competence-Guided Case-Base Editing Techniques

EWCBR '00 Proceedings of the 5th European Workshop on Advances in Case-Based Reasoning
Diagnosis and Decision Support

Case-Based Reasoning Technology, From Foundations to Applications
Text Categorization Using Compression Models

DCC '00 Proceedings of the Conference on Data Compression
DNA Sequence Classification Using Compression-Based Induction

DNA Sequence Classification Using Compression-Based Induction
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An Assessment of Case-Based Reasoning for Spam Filtering

Artificial Intelligence Review
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Learning drifting concepts: Example selection vs. example weighting

Intelligent Data Analysis
Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches

Artificial Intelligence Review
ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift

Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
A case-based technique for tracking concept drift in spam filtering

Knowledge-Based Systems
Tracking concept drift at feature selection stage in spamhunting: an anti-spam instance-based reasoning system

ECCBR'06 Proceedings of the 8th European conference on Advances in Case-Based Reasoning
Clustering by compression

IEEE Transactions on Information Theory
Support vector machines for spam categorization

IEEE Transactions on Neural Networks

Reexamination of CBR hypothesis

ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development
Spam filtering using semantic similarity approach and adaptive BPNN

Neurocomputing
Effective scheduling strategies for boosting performance on rule-based spam filtering frameworks

Journal of Systems and Software
Automatic case acquisition from texts for process-oriented case-based reasoning

Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we compare case-based spam filters, focusing on their resilience to concept drift. In particular, we evaluate how to track concept drift using a case-based spam filter that uses a feature-free distance measure based on text compression. In our experiments, we compare two ways to normalise such a distance measure, finding that the one proposed in [1] performs better. We show that a policy as simple as retaining misclassified examples has a hugely beneficial effect on handling concept drift in spam but, on its own, it results in the case base growing by over 30%. We then compare two different retention policies and two different forgetting policies (one a form of instance selection, the other a form of instance weighting) and find that they perform roughly as well as each other while keeping the case base size constant. Finally, we compare a feature-based textual case-based spam filter with our feature-free approach. In the face of concept drift, the feature-based approach requires the case base to be rebuilt periodically so that we can select a new feature set that better predicts the target concept. We find feature-free approaches to have lower error rates than their feature-based equivalents.