Mining the web to create minority language corpora
Proceedings of the tenth international conference on Information and knowledge management
A study of smoothing methods for language models applied to information retrieval
ACM Transactions on Information Systems (TOIS)
Unsupervised query segmentation using generative language models and wikipedia
Proceedings of the 17th international conference on World Wide Web
Cross-Language Retrieval with Wikipedia
Advances in Multilingual and Multimodal Information Retrieval
MediaWiki
A Wikipedia-based multilingual retrieval model
ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Deriving human-readable labels from SPARQL queries
Proceedings of the 7th International Conference on Semantic Systems
Hi-index | 0.00 |
Wikipedia provides an interesting amount of text for more than hundred languages. This also includes languages where no reference corpora or other linguistic resources are easily available. We have extracted background language models built from the content of Wikipedia in various languages. The models generated from Simple and English Wikipedia are compared to language models derived from other established corpora. The differences between the models in regard to term coverage, term distribution and correlation are described and discussed. We provide access to the full dataset and create visualizations of the language models that can be used exploratory. The paper describes the newly released dataset for 33 languages, and the services that we provide on top of them.