OPTICS: ordering points to identify the clustering structure
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
On coresets for k-means and k-median clustering
STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Density Connected Clustering with Local Subspace Preferences
ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
k-means++: the advantages of careful seeding
SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Towards subspace clustering on dynamic data: an incremental version of PreDeCon
Proceedings of the First International Workshop on Novel Data Stream Pattern Mining Techniques
The Journal of Machine Learning Research
@spam: the underground on 140 characters or less
Proceedings of the 17th ACM conference on Computer and communications security
Detecting spam bots in online social networking sites: a machine learning approach
DBSec'10 Proceedings of the 24th annual IFIP WG 11.3 working conference on Data and applications security and privacy
Design and Evaluation of a Real-Time URL Spam Filtering Service
SP '11 Proceedings of the 2011 IEEE Symposium on Security and Privacy
Spam detection on twitter using traditional classifiers
ATC'11 Proceedings of the 8th international conference on Autonomic and trusted computing
Proceedings of the 21st international conference on World Wide Web
Least squares quantization in PCM
IEEE Transactions on Information Theory
Detecting social spam campaigns on twitter
ACNS'12 Proceedings of the 10th international conference on Applied Cryptography and Network Security
Hi-index | 0.07 |
The rapid growth of Twitter has triggered a dramatic increase in spam volume and sophistication. The abuse of certain Twitter components such as ''hashtags'', ''mentions'', and shortened URLs enables spammers to operate efficiently. These same features, however, may be a key factor in identifying new spam accounts as shown in previous studies. Our study provides three novel contributions. Firstly, previous studies have approached spam detection as a classification problem, whereas we view it as an anomaly detection problem. Secondly, 95 one-gram features from tweet text were introduced alongside the user information analyzed in previous studies. Finally, to effectively handle the streaming nature of tweets, two stream clustering algorithms, StreamKM++ and DenStream, were modified to facilitate spam identification. Both algorithms clustered normal Twitter users, treating outliers as spammers. Each of these algorithms performed well individually, with StreamKM++ achieving 99% recall and a 6.4% false positive rate; and DenStream producing 99% recall and a 2.8% false positive rate. When used in conjunction, these algorithms reached 100% recall and a 2.2% false positive rate, meaning that our system was able to identify 100% of the spammers in our test while incorrectly detecting only 2.2% of normal users as spammers.