Language resources extracted from Wikipedia

Authors:
Denny Vrandecić;Philipp Sorg;Rudi Studer
Affiliations:
KIT and Wikimedia Deutschland, Karlsruhe and Berlin, Germany;KIT, Karlsruhe, Germany;KIT, Karlsruhe, Germany
Venue:
Proceedings of the sixth international conference on Knowledge capture
Year:
2011

Citing 6
Cited 1

Mining the web to create minority language corpora

Proceedings of the tenth international conference on Information and knowledge management
A study of smoothing methods for language models applied to information retrieval

ACM Transactions on Information Systems (TOIS)
Unsupervised query segmentation using generative language models and wikipedia

Proceedings of the 17th international conference on World Wide Web
Cross-Language Retrieval with Wikipedia

Advances in Multilingual and Multimodal Information Retrieval
MediaWiki

MediaWiki
A Wikipedia-based multilingual retrieval model

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

Deriving human-readable labels from SPARQL queries

Proceedings of the 7th International Conference on Semantic Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Wikipedia provides an interesting amount of text for more than hundred languages. This also includes languages where no reference corpora or other linguistic resources are easily available. We have extracted background language models built from the content of Wikipedia in various languages. The models generated from Simple and English Wikipedia are compared to language models derived from other established corpora. The differences between the models in regard to term coverage, term distribution and correlation are described and discussed. We provide access to the full dataset and create visualizations of the language models that can be used exploratory. The paper describes the newly released dataset for 33 languages, and the services that we provide on top of them.