Evaluation of a language identification system for mono- and multilingual text documents

Authors:
Olga Artemenko;Thomas Mandl;Margaryta Shramko;Christa Womser-Hacker
Affiliations:
University of Hildesheim, Hildesheim, Germany;University of Hildesheim, Hildesheim, Germany;University of Hildesheim, Hildesheim, Germany;University of Hildesheim, Hildesheim, Germany
Venue:
Proceedings of the 2006 ACM symposium on Applied computing
Year:
2006

Citing 3
Cited 4

Linguini: Language Identification for Multilingual Documents

HICSS '99 Proceedings of the Thirty-Second Annual Hawaii International Conference on System Sciences-Volume 2 - Volume 2
Automatic language identification of written texts

Proceedings of the 2004 ACM symposium on Applied computing
Barriers to Information Access across Languages on the Internet: Network and Language Effects

HICSS '06 Proceedings of the 39th Annual Hawaii International Conference on System Sciences - Volume 03

Current research issues and trends in non-English Web searching

Information Retrieval
Arabic script web page language identifications using decision tree neural networks

Pattern Recognition
Content redundancy in YouTube and its application to video tagging

ACM Transactions on Information Systems (TOIS)
Web retrieval experiments with the EuroGOV corpus at the university of hildesheim

CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories

Quantified Score

Hi-index	0.00

Visualization

Abstract

Language identification is an important task for web information retrieval. This paper presents the implementation of a tool for language identification in mono- and multi-lingual documents. The tool implements four algorithms for language identification. Furthermore, we present a n-gram approach for the identification of languages in multi-lingual documents. An evaluation for monolingual texts of varied length is presented. Results for eight languages including Ukrainian and Russian are presented. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The evaluation for multi-lingual documents is based on real world web documents. Our tool is able to recognize the languages present in a document with reasonable accuracy.