Language of vandalism: improving Wikipedia vandalism detection via stylometric analysis

Authors:
Manoj Harpalani;Michael Hart;Sandesh Singh;Rob Johnson;Yejin Choi
Affiliations:
Stony Brook University, NY;Stony Brook University, NY;Stony Brook University, NY;Stony Brook University, NY;Stony Brook University, NY
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Year:
2011

Citing 16
Cited 0

Induction of one-level decision trees

ML92 Proceedings of the ninth international workshop on Machine learning
Multi-interval Discretization Methods for Decision Tree Learning

SSPR '98/SPR '98 Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition
Accurate unlexicalized parsing

ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1
A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts

ACL '04 Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics
A content-driven reputation system for the wikipedia

Proceedings of the 16th international conference on World Wide Web
Automatically profiling the author of an anonymous text

Communications of the ACM - Inspiring Women in Computing
A survey of modern authorship attribution methods

Journal of the American Society for Information Science and Technology
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
The work of sustaining order in wikipedia: the banning of a vandal

Proceedings of the 2010 ACM conference on Computer supported cooperative work
Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata?

Proceedings of the Third European Workshop on System Security
Detecting Wikipedia vandalism with active learning and statistical language models

Proceedings of the 4th workshop on Information credibility
Automatic vandalism detection in Wikipedia

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Crowdsourcing a wikipedia vandalism corpus

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Authorship attribution using probabilistic context-free grammars

ACLShort '10 Proceedings of the ACL 2010 Conference Short Papers
"Got you!": automatic vandalism detection in Wikipedia with web-based shallow syntactic-semantic modeling

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Wikipedia vandalism detection: combining natural language, metadata, and reputation features

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

Community-based knowledge forums, such as Wikipedia, are susceptible to vandalism, i.e., ill-intentioned contributions that are detrimental to the quality of collective intelligence. Most previous work to date relies on shallow lexico-syntactic patterns and metadata to automatically detect vandalism in Wikipedia. In this paper, we explore more linguistically motivated approaches to vandalism detection. In particular, we hypothesize that textual vandalism constitutes a unique genre where a group of people share a similar linguistic behavior. Experimental results suggest that (1) statistical models give evidence to unique language styles in vandalism, and that (2) deep syntactic patterns based on probabilistic context free grammars (PCFG) discriminate vandalism more effectively than shallow lexico-syntactic patterns based on n-grams.