Attributing authorship of revisioned content

Authors:
Luca de Alfaro;Michael Shavlovsky
Affiliations:
University of California, Santa Cruz, CA, USA;University of California, Santa Cruz, CA, USA
Venue:
Proceedings of the 22nd international conference on World Wide Web
Year:
2013

Citing 12
Cited 1

Algorithms on strings, trees, and sequences: computer science and computational biology

Algorithms on strings, trees, and sequences: computer science and computational biology
A Space-Economical Suffix Tree Construction Algorithm

Journal of the ACM (JACM)
The string-to-string correction problem with block moves

ACM Transactions on Computer Systems (TOCS)
Why and Where: A Characterization of Data Provenance

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Data Provenance: Some Basic Issues

FST TCS 2000 Proceedings of the 20th Conference on Foundations of Software Technology and Theoretical Computer Science
A survey of data provenance in e-science

ACM SIGMOD Record
A content-driven reputation system for the wikipedia

Proceedings of the 16th international conference on World Wide Web
What motivates Wikipedians?

Communications of the ACM
The provenance of electronic data

Communications of the ACM - The psychology of security: why do good users make bad decisions?
Provenance for Computational Tasks: A Survey

Computing in Science and Engineering
Linear pattern matching algorithms

SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Measuring author contributions to the Wikipedia

WikiSym '08 Proceedings of the 4th International Symposium on Wikis

WikiWho: precise and efficient attribution of authorship of revisioned content

Proceedings of the 23rd international conference on World wide web

Quantified Score

Hi-index	0.00

Visualization

Abstract

A considerable portion of web content, from wikis to collaboratively edited documents, to code posted online, is revisioned. We consider the problem of attributing authorship to such revisioned content, and we develop scalable attribution algorithms that can be applied to very large bodies of revisioned content, such as the English Wikipedia. Since content can be deleted, only to be later re-inserted, we introduce a notion of authorship that requires comparing each new revision with the entire set of past revisions. For each portion of content in the newest revision, we search the entire history for content matches that are statistically unlikely to occur spontaneously, thus denoting common origin. We use these matches to compute the earliest possible attribution of each word (or each token) of the new content. We show that this "earliest plausible attribution" can be computed efficiently via compact summaries of the past revision history. This leads to an algorithm that runs in time proportional to the sum of the size of the most recent revision, and the total amount of change (edit work) in the revision history. This amount of change is typically much smaller than the total size of all past revisions. The resulting algorithm can scale to very large repositories of revisioned content, as we show via experimental data over the English Wikipedia.