Language identification in multi-lingual web-documents

Authors:
Thomas Mandl;Margaryta Shramko;Olga Tartakovski;Christa Womser-Hacker
Affiliations:
Information Science, Universität Hildesheim, Hildesheim, Germany;Information Science, Universität Hildesheim, Hildesheim, Germany;Information Science, Universität Hildesheim, Hildesheim, Germany;Information Science, Universität Hildesheim, Hildesheim, Germany
Venue:
NLDB'06 Proceedings of the 11th international conference on Applications of Natural Language to Information Systems
Year:
2006

Citing 7
Cited 1

Linguini: Language Identification for Multilingual Documents

HICSS '99 Proceedings of the Thirty-Second Annual Hawaii International Conference on System Sciences-Volume 2 - Volume 2
Monolingual Document Retrieval for European Languages

Information Retrieval
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Automatic language identification of written texts

Proceedings of the 2004 ACM symposium on Applied computing
Identifying, the coding system and language, of on-line documents on the Internet

COLING '96 Proceedings of the 16th conference on Computational linguistics - Volume 2
Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
Barriers to Information Access across Languages on the Internet: Network and Language Effects

HICSS '06 Proceedings of the 39th Annual Hawaii International Conference on System Sciences - Volume 03

Language detection and tracking in multilingual documents using weak estimators

SSPR&SPR'10 Proceedings of the 2010 joint IAPR international conference on Structural, syntactic, and statistical pattern recognition

Quantified Score

Hi-index	0.00

Visualization

Abstract

Language identification an important task for web information retrieval. This paper presents the implementation of a tool for language identification in mono- and multi-lingual documents. The tool implements four algorithms for language identification. Furthermore, we present a n-gram approach for the identification of languages in multi-lingual documents. An evaluation for monolingual texts of varied length is presented. Results for eight languages including Ukrainian and Russian are shown. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The evaluation for multi-lingual documents is based on both short synthetic documents and real world web documents. Our tool is able to recognize the languages present as well as the location of the language change with reasonable accuracy.