On the Optimality of the Simple Bayesian Classifier under Zero-One Loss
Machine Learning - Special issue on learning with probabilistic representations
Combining labeled and unlabeled data with co-training
COLT' 98 Proceedings of the eleventh annual conference on Computational learning theory
SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Information Retrieval
Machine Learning
An empirical study of smoothing techniques for language modeling
ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
A comparison of event models for Naive Bayes anti-spam e-mail filtering
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
An evaluation of Naive Bayes variants in content-based learning for spam filtering
Intelligent Data Analysis
Social networks and context-aware spam
Proceedings of the 2008 ACM conference on Computer supported cooperative work
Detecting and characterizing social spam campaigns
IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Detecting and analyzing automated activity on twitter
PAM'11 Proceedings of the 12th international conference on Passive and active measurement
Aiding the detection of fake accounts in large scale social online services
NSDI'12 Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation
You are how you click: clickstream analysis for Sybil detection
SEC'13 Proceedings of the 22nd USENIX conference on Security
Hi-index | 0.00 |
Many social networks are predicated on the assumption that a member's online information reflects his or her real identity. In such networks, members who fill their name fields with fictitious identities, company names, phone numbers, or just gibberish are violating the terms of service, polluting search results, and degrading the value of the site to real members. Finding and removing these accounts on the basis of their spammy names can both improve the site experience for real members and prevent further abusive activity. In this paper we describe a set of features that can be used by a Naive Bayes classifier to find accounts whose names do not represent real people. The model can detect both automated and human abusers and can be used at registration time, before other signals such as social graph or clickstream history are present. We use member data from LinkedIn to train and validate our model and to choose parameters. Our best-scoring model achieves AUC 0.85 on a sequestered test set. We ran the algorithm on live LinkedIn data for one month in parallel with our previous name scoring algorithm based on regular expressions. The false positive rate of our new algorithm (3.3%) was less than half that of the previous algorithm (7.0%). When the algorithm is run on email usernames as well as user-entered first and last names, it provides an effective way to catch not only bad human actors but also bots that have poor name and email generation algorithms.