The nature of statistical learning theory
The nature of statistical learning theory
Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
Machine learning in automated text categorization
ACM Computing Surveys (CSUR)
A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists
Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Text classification using string kernels
The Journal of Machine Learning Research
Support vector machines for spam categorization
IEEE Transactions on Neural Networks
Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks
Expert Systems with Applications: An International Journal
Hi-index | 0.01 |
This paper presents a content-based approach to spam detection based on low-level information. Instead of the traditional 'bag of words' representation, we use a 'bag of character n-grams' representation which avoids the sparse data problem that arises in n-grams on the word-level. Moreover, it is language-independent and does not require any lemmatizer or 'deep' text preprocessing. Based on experiments on Ling-Spam corpus we evaluate the proposed representation in combination with support vector machines. Both binary and term-frequency representations achieve high precision rates while maintaining recall on equally high level, which is a crucial factor for anti-spam filters, a cost sensitive application.