Using naive bayes to detect spammy names in social networks

Authors:
David Mandell Freeman
Affiliations:
LinkedIn Corporation, Mountain View, CA, USA
Venue:
Proceedings of the 2013 ACM workshop on Artificial intelligence and security
Year:
2013

Citing 13
Cited 0

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

Machine Learning - Special issue on learning with probabilistic representations
Combining labeled and unlabeled data with co-training

COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval

Information Retrieval
Machine Learning

Machine Learning
An empirical study of smoothing techniques for language modeling

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A comparison of event models for Naive Bayes anti-spam e-mail filtering

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
An evaluation of Naive Bayes variants in content-based learning for spam filtering

Intelligent Data Analysis
Social networks and context-aware spam

Proceedings of the 2008 ACM conference on Computer supported cooperative work
Detecting and characterizing social spam campaigns

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Detecting and analyzing automated activity on twitter

PAM'11 Proceedings of the 12th international conference on Passive and active measurement
Aiding the detection of fake accounts in large scale social online services

NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
You are how you click: clickstream analysis for Sybil detection

SEC'13 Proceedings of the 22nd USENIX conference on Security

Quantified Score

Hi-index	0.00

Visualization

Abstract

Many social networks are predicated on the assumption that a member's online information reflects his or her real identity. In such networks, members who fill their name fields with fictitious identities, company names, phone numbers, or just gibberish are violating the terms of service, polluting search results, and degrading the value of the site to real members. Finding and removing these accounts on the basis of their spammy names can both improve the site experience for real members and prevent further abusive activity. In this paper we describe a set of features that can be used by a Naive Bayes classifier to find accounts whose names do not represent real people. The model can detect both automated and human abusers and can be used at registration time, before other signals such as social graph or clickstream history are present. We use member data from LinkedIn to train and validate our model and to choose parameters. Our best-scoring model achieves AUC 0.85 on a sequestered test set. We ran the algorithm on live LinkedIn data for one month in parallel with our previous name scoring algorithm based on regular expressions. The false positive rate of our new algorithm (3.3%) was less than half that of the previous algorithm (7.0%). When the algorithm is run on email usernames as well as user-entered first and last names, it provides an effective way to catch not only bad human actors but also bots that have poor name and email generation algorithms.