Elements of information theory
Elements of information theory
A training algorithm for optimal margin classifiers
COLT '92 Proceedings of the fifth annual workshop on Computational learning theory
Representation and learning in information retrieval
Representation and learning in information retrieval
C4.5: programs for machine learning
C4.5: programs for machine learning
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering using word clusters via the information bottleneck method
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating cost-sensitive Unsolicited Bulk Email categorization
Proceedings of the 2002 ACM symposium on Applied computing
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
Learning to Classify Text Using Support Vector Machines: Methods, Theory and Algorithms
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Generating Accurate Rule Sets Without Global Optimization
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Enhanced word clustering for hierarchical text classification
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Information-theoretic co-clustering
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Transfer learning for cross-company software defect prediction
Information and Software Technology
Hi-index | 0.00 |
Most previous email spam filtering techniques rely on traditional classification learning which assumes the data from training and test sets are drawn from the same underlying distribution. However, in practice, this identical-distribution assumption often violates. In general, email service providers collect training data from various public available resources, while the tasks focus on users' individual inboxes. Topics in the mail-boxes vary among different users, and distributions shift as a result. In this paper, we propose an adaptive email spam filtering algorithm based on information theory which relaxes the identical-distribution assumption and adapts the knowledge learned from one distribution to another. Our work focuses on the content analysis which minimizes the loss in mutual information between email instances and word features, before and after classification. We present theoretical and empirical analyses to show that our algorithm is able to solve the adaptive email spam filtering problem well. The experimental results show that our algorithm greatly improves the accuracy of email filtering, against the traditional classification algorithms, while scaling very well.