Twitter spammer detection using data stream clustering

Authors:
Zachary Miller;Brian Dickinson;William Deitrick;Wei Hu;Alex Hai Wang
Affiliations:
Department of Computer Science, Houghton College, Houghton, NY, United States;Department of Computer Science, Houghton College, Houghton, NY, United States;Department of Computer Science, Houghton College, Houghton, NY, United States;Department of Computer Science, Houghton College, Houghton, NY, United States;College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA, United States
Venue:
Information Sciences: an International Journal
Year:
2014

Citing 14
Cited 0

OPTICS: ordering points to identify the clustering structure

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
On coresets for k-means and k-median clustering

STOC '04 Proceedings of the thirty-sixth annual ACM symposium on Theory of computing
Density Connected Clustering with Local Subspace Preferences

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
k-means++: the advantages of careful seeding

SODA '07 Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
Towards subspace clustering on dynamic data: an incremental version of PreDeCon

Proceedings of the First International Workshop on Novel Data Stream Pattern Mining Techniques
MOA: Massive Online Analysis

The Journal of Machine Learning Research
@spam: the underground on 140 characters or less

Proceedings of the 17th ACM conference on Computer and communications security
Detecting spam bots in online social networking sites: a machine learning approach

DBSec'10 Proceedings of the 24th annual IFIP WG 11.3 working conference on Data and applications security and privacy
Design and Evaluation of a Real-Time URL Spam Filtering Service

SP '11 Proceedings of the 2011 IEEE Symposium on Security and Privacy
Spam detection on twitter using traditional classifiers

ATC'11 Proceedings of the 8th international conference on Autonomic and trusted computing
Analyzing spammers' social networks for fun and profit: a case study of cyber criminal ecosystem on twitter

Proceedings of the 21st international conference on World Wide Web
Least squares quantization in PCM

IEEE Transactions on Information Theory
Detecting social spam campaigns on twitter

ACNS'12 Proceedings of the 10th international conference on Applied Cryptography and Network Security

Quantified Score

Hi-index	0.07

Visualization

Abstract

The rapid growth of Twitter has triggered a dramatic increase in spam volume and sophistication. The abuse of certain Twitter components such as ''hashtags'', ''mentions'', and shortened URLs enables spammers to operate efficiently. These same features, however, may be a key factor in identifying new spam accounts as shown in previous studies. Our study provides three novel contributions. Firstly, previous studies have approached spam detection as a classification problem, whereas we view it as an anomaly detection problem. Secondly, 95 one-gram features from tweet text were introduced alongside the user information analyzed in previous studies. Finally, to effectively handle the streaming nature of tweets, two stream clustering algorithms, StreamKM++ and DenStream, were modified to facilitate spam identification. Both algorithms clustered normal Twitter users, treating outliers as spammers. Each of these algorithms performed well individually, with StreamKM++ achieving 99% recall and a 6.4% false positive rate; and DenStream producing 99% recall and a 2.8% false positive rate. When used in conjunction, these algorithms reached 100% recall and a 2.2% false positive rate, meaning that our system was able to identify 100% of the spammers in our test while incorrectly detecting only 2.2% of normal users as spammers.