Wikipedia revision toolkit: efficiently accessing Wikipedia's edit history

Authors:
Oliver Ferschke;Torsten Zesch;Iryna Gurevych
Affiliations:
Technische Universität Darmstadt, Hochschulstrasse, Darmstadt, Germany;Technische Universität Darmstadt, Hochschulstrasse, Darmstadt, Germany;Technische Universität Darmstadt, Hochschulstrasse, Darmstadt, Germany
Venue:
HLT '11 Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations
Year:
2011

Citing 6
Cited 5

Introduction to the special issue on computational linguistics using large corpora

Computational Linguistics - Special issue on using large corpora: I
Computing trust from revision history

Proceedings of the 2006 International Conference on Privacy, Security and Trust: Bridge the Gap Between PST Technologies and Business Services
Mining wikipedia revision histories for improving sentence compression

HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
Mining meaning from Wikipedia

International Journal of Human-Computer Studies
Detecting Wikipedia vandalism with active learning and statistical language models

Proceedings of the 4th workshop on Information credibility
For the sake of simplicity: unsupervised extraction of lexical simplifications from Wikipedia

HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Measuring contextual fitness using error contexts extracted from the Wikipedia revision history

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Behind the article: recognizing dialog acts in Wikipedia talk pages

EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Temporal summarization of event-related updates in wikipedia

Proceedings of the 22nd international conference on World Wide Web companion
What makes a good biography?: multidimensional quality analysis based on wikipedia article feedback data

Proceedings of the 23rd international conference on World wide web
WHAD: Wikipedia historical attributes data

Language Resources and Evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia's edit history.