Introduction to the special issue on computational linguistics using large corpora
Computational Linguistics - Special issue on using large corpora: I
Computing trust from revision history
Proceedings of the 2006 International Conference on Privacy, Security and Trust: Bridge the Gap Between PST Technologies and Business Services
Mining wikipedia revision histories for improving sentence compression
HLT-Short '08 Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers
International Journal of Human-Computer Studies
Detecting Wikipedia vandalism with active learning and statistical language models
Proceedings of the 4th workshop on Information credibility
For the sake of simplicity: unsupervised extraction of lexical simplifications from Wikipedia
HLT '10 Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Measuring contextual fitness using error contexts extracted from the Wikipedia revision history
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Behind the article: recognizing dialog acts in Wikipedia talk pages
EACL '12 Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics
Temporal summarization of event-related updates in wikipedia
Proceedings of the 22nd international conference on World Wide Web companion
Proceedings of the 23rd international conference on World wide web
WHAD: Wikipedia historical attributes data
Language Resources and Evaluation
Hi-index | 0.00 |
We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but access is severely impeded by the lack of efficient tools for managing the huge amount of provided data. By using a dedicated storage format, our toolkit massively decreases the data volume to less than 2% of the original size, and at the same time provides an easy-to-use interface to access the revision data. The language-independent design allows to process any language represented in Wikipedia. We expect this work to consolidate NLP research using Wikipedia in general, and to foster research making use of the knowledge encoded in Wikipedia's edit history.