Autonomous link spam detection in purely collaborative environments

Authors:
Andrew G. West;Avantika Agrawal;Phillip Baker;Brittney Exline;Insup Lee
Affiliations:
University of Pennsylvania - Philadelphia, PA;University of Pennsylvania - Philadelphia, PA;University of Pennsylvania - Philadelphia, PA;University of Pennsylvania - Philadelphia, PA;University of Pennsylvania - Philadelphia, PA
Venue:
Proceedings of the 7th International Symposium on Wikis and Open Collaboration
Year:
2011

Citing 17
Cited 2

Wrappers for feature subset selection

Artificial Intelligence - Special issue on relevance
The Alternating Decision Tree Learning Algorithm

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
The Sybil Attack

IPTPS '01 Revised Papers from the First International Workshop on Peer-to-Peer Systems
Detecting spam web pages through content analysis

Proceedings of the 15th international conference on World Wide Web
Detecting online commercial intention (OCI)

Proceedings of the 15th international conference on World Wide Web
Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges

IEEE Internet Computing
Creating, destroying, and restoring value in wikipedia

Proceedings of the 2007 international ACM conference on Supporting group work
All your iFRAMEs point to Us

SS'08 Proceedings of the 17th conference on Security symposium
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata?

Proceedings of the Third European Workshop on System Security
Crowdsourcing a wikipedia vandalism corpus

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Detecting spammers with SNARE: spatio-temporal network-level automatic reputation engine

SSYM'09 Proceedings of the 18th conference on USENIX security symposium
Detecting and characterizing social spam campaigns

IMC '10 Proceedings of the 10th ACM SIGCOMM conference on Internet measurement
Proliferation and Detection of Blog Spam

IEEE Security and Privacy
Wikipedia vandalism detection: combining natural language, metadata, and reputation features

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
The nuts and bolts of a forum spam automator

LEET'11 Proceedings of the 4th USENIX conference on Large-scale exploits and emergent threats
Link spamming Wikipedia for profit

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference

Link spamming Wikipedia for profit

Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference
Spamming for science: active measurement in web 2.0 abuse research

FC'12 Proceedings of the 16th international conference on Financial Cryptography and Data Security

Quantified Score

Hi-index	0.00

Visualization

Abstract

Collaborative models (e.g., wikis) are an increasingly prevalent Web technology. However, the open-access that defines such systems can also be utilized for nefarious purposes. In particular, this paper examines the use of collaborative functionality to add inappropriate hyperlinks to destinations outside the host environment (i.e., link spam). The collaborative encyclopedia, Wikipedia, is the basis for our analysis. Recent research has exposed vulnerabilities in Wikipedia's link spam mitigation, finding that human editors are latent and dwindling in quantity. To this end, we propose and develop an autonomous classifier for link additions. Such a system presents unique challenges. For example, low barriers-to-entry invite a diversity of spam types, not just those with economic motivations. Moreover, issues can arise with how a link is presented (regardless of the destination). In this work, a spam corpus is extracted from over 235,000 link additions to English Wikipedia. From this, 40+ features are codified and analyzed. These indicators are computed using wiki metadata, landing site analysis, and external data sources. The resulting classifier attains 64% recall at 0.5% false-positives (ROC-AUC= 0.97). Such performance could enable egregious link additions to be blocked automatically with low false-positive rates, while prioritizing the remainder for human inspection. Finally, a live Wikipedia implementation of the technique has been developed.