Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches

Authors:
Sarah Jane Delany;Derek Bridge
Affiliations:
Dublin Institute of Technology, Dublin, Ireland;University College Cork, Cork, Ireland
Venue:
Artificial Intelligence Review
Year:
2006

Citing 18
Cited 6

C4.5: programs for machine learning

C4.5: programs for machine learning
Reduction Techniques for Instance-BasedLearning Algorithms

Machine Learning
On Comparing Classifiers: Pitfalls toAvoid and a Recommended Approach

Data Mining and Knowledge Discovery
Advances in Instance Selection for Instance-Based Learning Algorithms

Data Mining and Knowledge Discovery
The similarity metric

SODA '03 Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms
Using k-d Trees to Improve the Retrieval Step in Case-Based Reasoning

EWCBR '93 Selected papers from the First European Workshop on Topics in Case-Based Reasoning
Fish and Shrink. A Next Step Towards Efficient Case Retrieval in Large-Scale Case Bases

EWCBR '96 Proceedings of the Third European Workshop on Advances in Case-Based Reasoning
Diagnosis and Decision Support

Case-Based Reasoning Technology, From Foundations to Applications
Text Categorization Using Compression Models

DCC '00 Proceedings of the Conference on Data Compression
DNA Sequence Classification Using Compression-Based Induction

DNA Sequence Classification Using Compression-Based Induction
Towards parameter-free data mining

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
An Assessment of Case-Based Reasoning for Spam Filtering

Artificial Intelligence Review
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
ECUE: A Spam Filter that Uses Machine Learning to Track Concept Drift

Proceedings of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29 -- September 1, 2006, Riva del Garda, Italy
Learning when training data are costly: the effect of class distribution on tree induction

Journal of Artificial Intelligence Research
Remembering to forget: a competence-preserving case deletion policy for case-based reasoning systems

IJCAI'95 Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1
A case-based technique for tracking concept drift in spam filtering

Knowledge-Based Systems
Clustering by compression

IEEE Transactions on Information Theory

Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering

ICCBR '07 Proceedings of the 7th international conference on Case-Based Reasoning: Case-Based Reasoning Research and Development
Review: A review of machine learning approaches to Spam filtering

Expert Systems with Applications: An International Journal
Managing computer files via artificial intelligence approaches

Artificial Intelligence Review
Detecting visually similar Web pages: Application to phishing detection

ACM Transactions on Internet Technology (TOIT)
Noise reduction for instance-based learning with a local maximal margin approach

Journal of Intelligent Information Systems
CBTV: visualising case bases for similarity measure design and selection

ICCBR'10 Proceedings of the 18th international conference on Case-Based Reasoning Research and Development

Quantified Score

Hi-index	0.00

Visualization

Abstract

Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison, which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to classify emails than the feature-based system. Improvements in the classification time of both kinds of systems can be obtained by applying case base editing algorithms, which aim to remove noisy and redundant cases from a case base while maintaining, or even improving, generalisation accuracy. We report empirical results using the Competence-Based Editing (CBE) technique. We show that CBE removes more cases when we use the distance measure based on text compression (without significant changes in generalisation accuracy) than it does when we use the feature-based approach.