"Got you!": automatic vandalism detection in Wikipedia with web-based shallow syntactic-semantic modeling

Authors:
William Yang Wang;Kathleen R. McKeown
Affiliations:
Columbia University;Columbia University
Venue:
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics
Year:
2010

Citing 12
Cited 11

A technique for isolating differences between files

Communications of the ACM
Logistic Model Trees

Machine Learning
Feature-rich part-of-speech tagging with a cyclic dependency network

NAACL '03 Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1
Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13
Discriminative syntactic language modeling for speech recognition

ACL '05 Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics
Spam Filtering Using Statistical Data Compression Models

The Journal of Machine Learning Research
Does it matter who contributes: a study on featured articles in the german wikipedia

Proceedings of the eighteenth conference on Hypertext and hypermedia
Cooperation and quality in wikipedia

Proceedings of the 2007 international symposium on Wikis
Using dynamic markov compression to detect vandalism in the wikipedia

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
The work of sustaining order in wikipedia: the banning of a vandal

Proceedings of the 2010 ACM conference on Computer supported cooperative work
Automatic vandalism detection in Wikipedia

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Measuring author contributions to the Wikipedia

WikiSym '08 Proceedings of the 4th International Symposium on Wikis

Providing cross-lingual editing assistance to Wikipedia editors

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Language of vandalism: improving Wikipedia vandalism detection via stylometric analysis

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Punctuation: making a point in unsupervised dependency parsing

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Gender attribution: tracing stylometric evidence beyond topic and genre

CoNLL '11 Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Detecting levels of interest from spoken dialog with multistream prediction feedback and similarity based hierarchical fusion learning

SIGDIAL '11 Proceedings of the SIGDIAL 2011 Conference
Trust in collaborative web applications

Future Generation Computer Systems
Automatic detection of speaker state: Lexical, prosodic, and phonetic approaches to level-of-interest and intoxication classification

Computer Speech and Language
Historical analysis of legal opinions with a sparse mixed-effects latent variable model

ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1
"Love ya, jerkface": using sparse log-linear models to build positive (and impolite) relationships with teens

SIGDIAL '12 Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Feeling the pulse of a wiki: visualization of recent changes in Wikipedia

Proceedings of the 5th International Symposium on Visual Information Communication and Interaction
Automated decision support for human tasks in a collaborative system: the case of deletion in Wikipedia

Proceedings of the 9th International Symposium on Open Collaboration

Quantified Score

Hi-index	0.00

Visualization

Abstract

Discriminating vandalism edits from non-vandalism edits in Wikipedia is a challenging task, as ill-intentioned edits can include a variety of content and be expressed in many different forms and styles. Previous studies are limited to rule-based methods and learning based on lexical features, lacking in linguistic analysis. In this paper, we propose a novel Web-based shallow syntactic-semantic modeling method, which utilizes Web search results as resource and trains topic-specific n-tag and syntactic n-gram language models to detect vandalism. By combining basic task-specific and lexical features, we have achieved high F-measures using logistic boosting and logistic model trees classifiers, surpassing the results reported by major Wikipedia vandalism detection systems.