Detecting Wikipedia vandalism with active learning and statistical language models

Authors:
Si-Chi Chin;W. Nick Street;Padmini Srinivasan;David Eichmann
Affiliations:
The University of Iowa, Iowa City, IA, USA;The University of Iowa, Iowa City, IA, USA;The University of Iowa, Iowa City, IA, USA;The University of Iowa, Iowa City, IA, USA
Venue:
Proceedings of the 4th workshop on Information credibility
Year:
2010

Citing 11
Cited 12

Employing EM and Pool-Based Active Learning for Text Classification

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
Studying cooperation and conflict between authors with history flow visualizations

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
He says, she says: conflict and coordination in Wikipedia

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Measuring Qualities of Articles Contributed by Online Communities

WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Know your neighbors: web spam detection using the web topology

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Creating, destroying, and restoring value in wikipedia

Proceedings of the 2007 international ACM conference on Supporting group work
Measuring article quality in wikipedia: models and evaluation

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
On ranking controversies in wikipedia: models and evaluation

WSDM '08 Proceedings of the 2008 International Conference on Web Search and Data Mining
Can you ever trust a wiki?: impacting perceived trustworthiness in wikipedia

Proceedings of the 2008 ACM conference on Computer supported cooperative work
Modeling trust in collaborative information systems

COLCOM '07 Proceedings of the 2007 International Conference on Collaborative Computing: Networking, Applications and Worksharing
Automatic vandalism detection in Wikipedia

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

Elusive vandalism detection in wikipedia: a text stability-based approach

CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
Wikipedia vandalism detection

Proceedings of the 20th international conference companion on World wide web
Wikipedia vandalism detection: combining natural language, metadata, and reputation features

CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Wikipedia revision toolkit: efficiently accessing Wikipedia's edit history

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations
Language of vandalism: improving Wikipedia vandalism detection via stylometric analysis

HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers - Volume 2
Vandalism detection in Wikipedia: a high-performing, feature-rich model and its reduction through Lasso

Proceedings of the 7th International Symposium on Wikis and Open Collaboration
Automatic Assessment of Document Quality in Web Collaborative Digital Libraries

Journal of Data and Information Quality (JDIQ)
How the web can help Wikipedia: a study on information complementation of Wikipedia by the web

Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication
Trust in collaborative web applications

Future Generation Computer Systems
Common Sense Reasoning for Detection, Prevention, and Mitigation of Cyberbullying

ACM Transactions on Interactive Intelligent Systems (TiiS) - Special Issue on Common Sense for Interactive Systems
Detecting wikipedia vandalism with a contributing efficiency-based approach

WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
WHAD: Wikipedia historical attributes data

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes an active learning approach using language model statistics to detect Wikipedia vandalism. Wikipedia is a popular and influential collaborative information system. The collaborative nature of authoring, as well as the high visibility of its content, have exposed Wikipedia articles to vandalism. Vandalism is defined as malicious editing intended to compromise the integrity of the content of articles. Extensive manual efforts are being made to combat vandalism and an automated approach to alleviate the laborious process is needed. This paper builds statistical language models, constructing distributions of words from the revision history of Wikipedia articles. As vandalism often involves the use of unexpected words to draw attention, the fitness (or lack thereof) of a new edit when compared with language models built from previous versions may well indicate that an edit is a vandalism instance. In addition, the paper adopts an active learning model to solve the problem of noisy and incomplete labeling of Wikipedia vandalism. The Wikipedia domain with its revision histories offers a novel context in which to explore the potential of language models in characterizing author intention. As the experimental results presented in the paper demonstrate, these models hold promise for vandalism detection.