Revisiting reverts: accurate revert detection in wikipedia

Authors:
Fabian Flöck;Denny Vrandečić;Elena Simperl
Affiliations:
Karlsruhe Institute of Technology, Karlsruhe, Germany;Karlsruhe Institute of Technology, Karlsruhe, Germany;Karlsruhe Institute of Technology, Karlsruhe, Germany
Venue:
Proceedings of the 23rd ACM conference on Hypertext and social media
Year:
2012

Citing 13
Cited 2

Syntactic clustering of the Web

Selected papers from the sixth international conference on World Wide Web
He says, she says: conflict and coordination in Wikipedia

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A content-driven reputation system for the wikipedia

Proceedings of the 16th international conference on World Wide Web
An improved construction for counting bloom filters

ESA'06 Proceedings of the 14th conference on Annual European Symposium - Volume 14
Creating, destroying, and restoring value in wikipedia

Proceedings of the 2007 international ACM conference on Supporting group work
rv you're dumb: identifying discarded work in Wiki article history

Proceedings of the 5th International Symposium on Wikis and Open Collaboration
Herding the cats: the influence of groups in coordinating peer production

Proceedings of the 5th International Symposium on Wikis and Open Collaboration
The singularity is not near: slowing growth of Wikipedia

Proceedings of the 5th International Symposium on Wikis and Open Collaboration
A jury of your peers: quality, experience and ownership in Wikipedia

Proceedings of the 5th International Symposium on Wikis and Open Collaboration
Beyond Wikipedia: coordination and conflict in online production groups

Proceedings of the 2010 ACM conference on Computer supported cooperative work
Assigning trust to Wikipedia content

WikiSym '08 Proceedings of the 4th International Symposium on Wikis
WP:clubhouse?: an exploration of Wikipedia's gender imbalance

Proceedings of the 7th International Symposium on Wikis and Open Collaboration
Wiki grows up: arbitrary data models, access control, and beyond

Proceedings of the 7th International Symposium on Wikis and Open Collaboration

Identifying, understanding and detecting recurring, harmful behavior patterns in collaborative Wikipedia editing: doctoral proposal

Proceedings of the 22nd international conference on World Wide Web companion
Tell me more: an actionable quality model for Wikipedia

Proceedings of the 9th International Symposium on Open Collaboration

Quantified Score

Hi-index	0.00

Visualization

Abstract

Wikipedia is commonly used as a proving ground for research in collaborative systems. This is likely due to its popularity and scale, but also to the fact that large amounts of data about its formation and evolution are freely available to inform and validate theories and models of online collaboration. As part of the development of such approaches, revert detection is often performed as an important pre-processing step in tasks as diverse as the extraction of implicit networks of editors, the analysis of edit or editor features and the removal of noise when analyzing the emergence of the content of an article. The current state of the art in revert detection is based on a rather naive approach, which identifies revision duplicates based on MD5 hash values. This is an efficient, but not very precise technique that forms the basis for the majority of research based on revert relations in Wikipedia. In this paper we prove that this method has a number of important drawbacks - it only detects a limited number of reverts, while simultaneously misclassifying too many edits as reverts, and not distinguishing between complete and partial reverts. This is very likely to hamper the accurate interpretation of the findings of revert-related research. We introduce an improved algorithm for the detection of reverts based on word tokens added or deleted to adresses these drawbacks. We report on the results of a user study and other tests demonstrating the considerable gains in accuracy and coverage by our method, and argue for a positive trade-off, in certain research scenarios, between these improvements and our algorithm's increased runtime.