PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric
Journal of the ACM (JACM)
EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1
Word-based dialect identification with georeferenced rules
EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Improved N-grams approach for web page language identification
Transactions on computational collective intelligence V
A comparison of language identification approaches on short, query-style texts
ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Drive-by language identification: a byproduct of applied prototype semantics
CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Bootstrapped language identification for multi-site internet domains
Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification
ACM Transactions on the Web (TWEB)
On text preprocessing for opinion mining outside of laboratory environments
AMT'12 Proceedings of the 8th international conference on Active Media Technology
Searching for Translated Plagiarism with the Help of Desktop Grids
Journal of Grid Computing
Hi-index | 0.00 |
Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character n -grams are in use, mainly with identification based on Markov models or on character n -gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.