WikiWho: precise and efficient attribution of authorship of revisioned content

Authors:
Fabian Flöck;Maribel Acosta
Affiliations:
Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany;Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
Venue:
Proceedings of the 23rd international conference on World wide web
Year:
2014

Citing 11
Cited 0

Studying cooperation and conflict between authors with history flow visualizations

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
He says, she says: conflict and coordination in Wikipedia

Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
A content-driven reputation system for the wikipedia

Proceedings of the 16th international conference on World Wide Web
Version Control with Subversion

Version Control with Subversion
Maintaining Fine-Grained Code Metadata Regardless of Moving, Copying and Merging

SCAM '09 Proceedings of the 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation
Measuring author contributions to the Wikipedia

WikiSym '08 Proceedings of the 4th International Symposium on Wikis
Assigning trust to Wikipedia content

WikiSym '08 Proceedings of the 4th International Symposium on Wikis
Ownership, experience and defects: a fine-grained study of authorship

Proceedings of the 33rd International Conference on Software Engineering
Standardized code quality benchmarking for improving software maintainability

Software Quality Control
Triaging incoming change requests: Bug or commit history, or code authorship?

ICSM '12 Proceedings of the 2012 IEEE International Conference on Software Maintenance (ICSM)
Attributing authorship of revisioned content

Proceedings of the 22nd international conference on World Wide Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

Revisioned text content is present in numerous collaboration platforms on the Web, most notably Wikis. To track authorship of text tokens in such systems has many potential applications; the identification of main authors for licensing reasons or tracing collaborative writing patterns over time, to name some. In this context, two main challenges arise. First, it is critical for such an authorship tracking system to be precise in its attributions, to be reliable for further processing. Second, it has to run efficiently even on very large datasets, such as Wikipedia. As a solution, we propose a graph-based model to represent revisioned content and an algorithm over this model that tackles both issues effectively. We describe the optimal implementation and design choices when tuning it to a Wiki environment. We further present a gold standard of 240 tokens from English Wikipedia articles annotated with their origin. This gold standard was created manually and confirmed by multiple independent users of a crowdsourcing platform. It is the first gold standard of this kind and quality and our solution achieves an average of 95% precision on this data set. We also perform a first-ever precision evaluation of the state-of-the-art algorithm for the task, exceeding it by over 10% on average. Our approach outperforms the execution time of the state-of-the-art by one order of magnitude, as we demonstrate on a sample of over 240 English Wikipedia articles. We argue that the increased size of an optional materialization of our results by about 10% compared to the baseline is a favorable trade-off, given the large advantage in runtime performance.