Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
A Space-Economical Suffix Tree Construction Algorithm
Journal of the ACM (JACM)
The string-to-string correction problem with block moves
ACM Transactions on Computer Systems (TOCS)
Why and Where: A Characterization of Data Provenance
ICDT '01 Proceedings of the 8th International Conference on Database Theory
Data Provenance: Some Basic Issues
FST TCS 2000 Proceedings of the 20th Conference on Foundations of Software Technology and Theoretical Computer Science
A survey of data provenance in e-science
ACM SIGMOD Record
A content-driven reputation system for the wikipedia
Proceedings of the 16th international conference on World Wide Web
Communications of the ACM
The provenance of electronic data
Communications of the ACM - The psychology of security: why do good users make bad decisions?
Provenance for Computational Tasks: A Survey
Computing in Science and Engineering
Linear pattern matching algorithms
SWAT '73 Proceedings of the 14th Annual Symposium on Switching and Automata Theory (swat 1973)
Measuring author contributions to the Wikipedia
WikiSym '08 Proceedings of the 4th International Symposium on Wikis
WikiWho: precise and efficient attribution of authorship of revisioned content
Proceedings of the 23rd international conference on World wide web
Hi-index | 0.00 |
A considerable portion of web content, from wikis to collaboratively edited documents, to code posted online, is revisioned. We consider the problem of attributing authorship to such revisioned content, and we develop scalable attribution algorithms that can be applied to very large bodies of revisioned content, such as the English Wikipedia. Since content can be deleted, only to be later re-inserted, we introduce a notion of authorship that requires comparing each new revision with the entire set of past revisions. For each portion of content in the newest revision, we search the entire history for content matches that are statistically unlikely to occur spontaneously, thus denoting common origin. We use these matches to compute the earliest possible attribution of each word (or each token) of the new content. We show that this "earliest plausible attribution" can be computed efficiently via compact summaries of the past revision history. This leads to an algorithm that runs in time proportional to the sum of the size of the most recent revision, and the total amount of change (edit work) in the revision history. This amount of change is typically much smaller than the total size of all past revisions. The resulting algorithm can scale to very large repositories of revisioned content, as we show via experimental data over the English Wikipedia.