Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso

Authors:
Sara Javanmardi;David W. McDonald;Cristina V. Lopes
Affiliations:
University of California, Irvine;University of Washington;University of California, Irvine
Venue:
Proceedings of the 7th International Symposium on Wikis and Open Collaboration
Year:
2011

Citing 12
Cited 5

Random Forests

Machine Learning
Studying cooperation and conflict between authors with history flow visualizations

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
An empirical comparison of supervised learning algorithms

ICML '06 Proceedings of the 23rd international conference on Machine learning
A content-driven reputation system for the wikipedia

Proceedings of the 16th international conference on World Wide Web
Creating, destroying, and restoring value in wikipedia

Proceedings of the 2007 international ACM conference on Supporting group work
Using dynamic markov compression to detect vandalism in the wikipedia

Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval
The work of sustaining order in wikipedia: the banning of a vandal

Proceedings of the 2010 ACM conference on Computer supported cooperative work
Modeling user reputation in wikis

Statistical Analysis and Data Mining
Detecting Wikipedia vandalism with active learning and statistical language models

Proceedings of the 4th workshop on Information credibility
Automatic vandalism detection in Wikipedia

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Crowdsourcing a wikipedia vandalism corpus

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Wikipedia vandalism detection: combining natural language, metadata, and reputation features

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II

Feeling the pulse of a wiki: visualization of recent changes in Wikipedia

Proceedings of the 5th International Symposium on Visual Information Communication and Interaction
Automatic vandalism detection in wikipedia with active associative classification

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Automatic vandalism detection in wikipedia with active associative classification

TPDL'12 Proceedings of the Second international conference on Theory and Practice of Digital Libraries
Wikipedia customization through web augmentation techniques

Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration
Classifying Wikipedia articles using network motif counts and ratios

Proceedings of the Eighth Annual International Symposium on Wikis and Open Collaboration

Quantified Score

Hi-index	0.00

Visualization

Abstract

User generated content (UGC) constitutes a significant fraction of the Web. However, some wiiki-based sites, such as Wikipedia, are so popular that they have become a favorite target of spammers and other vandals. In such popular sites, human vigilance is not enough to combat vandalism, and tools that detect possible vandalism and poor-quality contributions become a necessity. The application of machine learning techniques holds promise for developing efficient online algorithms for better tools to assist users in vandalism detection. We describe an efficient and accurate classifier that performs vandalism detection in UGC sites. We show the results of our classifier in the PAN Wikipedia dataset. We explore the effectiveness of a combination of 66 individual features that produce an AUC of 0.9553 on a test dataset -- the best result to our knowledge. Using Lasso optimization we then reduce our feature--rich model to a much smaller and more efficient model of 28 features that performs almost as well -- the drop in AUC being only 0.005. We describe how this approach can be generalized to other user generated content systems and describe several applications of this classifier to help users identify potential vandalism.