Disentangling from babylonian confusion – unsupervised language identification

  • Authors:
  • Chris Biemann;Sven Teresniak

  • Affiliations:
  • Computer Science Institute, NLP Dept., Leipzig University, Leipzig, Germany;Computer Science Institute, NLP Dept., Leipzig University, Leipzig, Germany

  • Venue:
  • CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

This work presents an unsupervised solution to language identification. The method sorts multilingual text corpora on the basis of sentences into the different languages that are contained and makes no assumptions on the number or size of the monolingual fractions. Evaluation on 7-lingual corpora and bilingual corpora show that the quality of classification is comparable to supervised approaches and works almost error-free from 100 sentences per language on.