Disentangling from babylonian confusion – unsupervised language identification

Authors:
Chris Biemann;Sven Teresniak
Affiliations:
Computer Science Institute, NLP Dept., Leipzig University, Leipzig, Germany;Computer Science Institute, NLP Dept., Leipzig University, Leipzig, Germany
Venue:
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Year:
2005

Citing 3
Cited 4

Towards Automatic Web Genre Identification

HICSS '02 Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS'02)-Volume 4 - Volume 4
Towards terascale knowledge acquisition

COLING '04 Proceedings of the 20th international conference on Computational Linguistics
Automatically building concept structures and displaying concept trails for the use in brainstorming sessions and content management systems

IICS'04 Proceedings of the 4th international conference on Innovative Internet Community Systems

Chinese whispers: an efficient graph clustering algorithm and its application to natural language processing problems

TextGraphs-1 Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing
Language identification of search engine queries

ACL '09 Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2
Arabic script web page language identifications using decision tree neural networks

Pattern Recognition
Drive-by language identification: a byproduct of applied prototype semantics

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This work presents an unsupervised solution to language identification. The method sorts multilingual text corpora on the basis of sentences into the different languages that are contained and makes no assumptions on the number or size of the monolingual fractions. Evaluation on 7-lingual corpora and bilingual corpora show that the quality of classification is comparable to supervised approaches and works almost error-free from 100 sentences per language on.