Linguini: Language Identification for Multilingual Documents
HICSS '99 Proceedings of the Thirty-Second Annual Hawaii International Conference on System Sciences-Volume 2 - Volume 2
Automatic language identification of written texts
Proceedings of the 2004 ACM symposium on Applied computing
Barriers to Information Access across Languages on the Internet: Network and Language Effects
HICSS '06 Proceedings of the 39th Annual Hawaii International Conference on System Sciences - Volume 03
Current research issues and trends in non-English Web searching
Information Retrieval
Content redundancy in YouTube and its application to video tagging
ACM Transactions on Information Systems (TOIS)
Web retrieval experiments with the EuroGOV corpus at the university of hildesheim
CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Hi-index | 0.00 |
Language identification is an important task for web information retrieval. This paper presents the implementation of a tool for language identification in mono- and multi-lingual documents. The tool implements four algorithms for language identification. Furthermore, we present a n-gram approach for the identification of languages in multi-lingual documents. An evaluation for monolingual texts of varied length is presented. Results for eight languages including Ukrainian and Russian are presented. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The evaluation for multi-lingual documents is based on real world web documents. Our tool is able to recognize the languages present in a document with reasonable accuracy.