Drive-by language identification: a byproduct of applied prototype semantics

Authors:
Ronald Winnemöller
Affiliations:
Regional Computer Centre, University of Hamburg, Hamburg
Venue:
CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Year:
2010

Citing 5
Cited 0

Language identification in web pages

Proceedings of the 2005 ACM symposium on Applied computing
Language Identification on the Web: Extending the Dictionary Method

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Can corpus based measures be used for comparative study of languages?

SigMorPhon '07 Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology
Constructing text sense representations

TextMean '04 Proceedings of the 2nd Workshop on Text Meaning and Interpretation
Disentangling from babylonian confusion – unsupervised language identification

CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

While there exist many effective and efficient algorithms, most of them based on supervised n-gram or word dictionary methods, we propose a semi-supervised approach to language identification, based on prototype semantics. Our method is primarily aimed at noise-rich environments with only very small text fragments to analyze and no training data available, even at analyzing the probable language affiliations of single words. We have integrated our prototype system into a larger web crawling and information management architecture and evaluated the prototype against an experimental setup including datasets in 11 european languages.