Evaluation of a language identification system for mono- and multilingual text documents

  • Authors:
  • Olga Artemenko;Thomas Mandl;Margaryta Shramko;Christa Womser-Hacker

  • Affiliations:
  • University of Hildesheim, Hildesheim, Germany;University of Hildesheim, Hildesheim, Germany;University of Hildesheim, Hildesheim, Germany;University of Hildesheim, Hildesheim, Germany

  • Venue:
  • Proceedings of the 2006 ACM symposium on Applied computing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Language identification is an important task for web information retrieval. This paper presents the implementation of a tool for language identification in mono- and multi-lingual documents. The tool implements four algorithms for language identification. Furthermore, we present a n-gram approach for the identification of languages in multi-lingual documents. An evaluation for monolingual texts of varied length is presented. Results for eight languages including Ukrainian and Russian are presented. It could be shown that n-gram-based approaches outperform word-based algorithms for short texts. For longer texts, the performance is comparable. The evaluation for multi-lingual documents is based on real world web documents. Our tool is able to recognize the languages present in a document with reasonable accuracy.