Language identification in web pages
Proceedings of the 2005 ACM symposium on Applied computing
Language Identification on the Web: Extending the Dictionary Method
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Can corpus based measures be used for comparative study of languages?
SigMorPhon '07 Proceedings of Ninth Meeting of the ACL Special Interest Group in Computational Morphology and Phonology
Constructing text sense representations
TextMean '04 Proceedings of the 2nd Workshop on Text Meaning and Interpretation
Disentangling from babylonian confusion – unsupervised language identification
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Hi-index | 0.00 |
While there exist many effective and efficient algorithms, most of them based on supervised n-gram or word dictionary methods, we propose a semi-supervised approach to language identification, based on prototype semantics. Our method is primarily aimed at noise-rich environments with only very small text fragments to analyze and no training data available, even at analyzing the probable language affiliations of single words. We have integrated our prototype system into a larger web crawling and information management architecture and evaluated the prototype against an experimental setup including datasets in 11 european languages.