A heuristic-based feature selection method for clustering spam emails

Authors:
Jungsuk Song;Masashi Eto;Hyung Chan Kim;Daisuke Inoue;Koji Nakao
Affiliations:
National Institute of Information and Communications Technology, Tokyo, Japan;National Institute of Information and Communications Technology, Tokyo, Japan;National Institute of Information and Communications Technology, Tokyo, Japan;National Institute of Information and Communications Technology, Tokyo, Japan;National Institute of Information and Communications Technology, Tokyo, Japan
Venue:
ICONIP'10 Proceedings of the 17th international conference on Neural information processing: theory and algorithms - Volume Part I
Year:
2010

Citing 5
Cited 1

A large-scale study of the evolution of web pages

Software—Practice & Experience - Special issue: Web technologies
Spamscatter: characterizing internet scam hosting infrastructure

SS'07 Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium
Characterizing botnets from email spam records

LEET'08 Proceedings of the 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats
Spamming botnets: signatures and characteristics

Proceedings of the ACM SIGCOMM 2008 conference on Data communication
An Empirical Study of Spam: Analyzing Spam Sending Systems and Malicious Web Servers

SAINT '10 Proceedings of the 2010 10th IEEE/IPSJ International Symposium on Applications and the Internet

Clustering for semi-supervised spam filtering

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, in order to cope with spam based attacks, there have been many efforts made towards the clustering of spam emails. During the clustering process, many statistical features (e.g., the size of emails) are used for calculating similarities between spam emails. In many cases, however, some of the features may be redundant or contribute little to the clustering process. Feature selection is one of the most typical methods used to identify a subset of key features from an initial set. In this paper, we propose a heuristic-based feature selection method for clustering spam emails. Unlike the existing methods in that they make the combinations of given features and evaluate them using data mining and machine learning techniques, our method focuses on evaluating each feature according to only its value distribution in spam clusters. With our method, we identified 4 significant features which yielded a clustering accuracy of 86.33% with low time complexity.