Language Identification on the Web: Extending the Dictionary Method

Authors:
Radim &#/344ehů/ř/ek;Milan Kolkus
Affiliations:
Masaryk University in Brno,;Seznam.cz, a.s.,
Venue:
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
Year:
2009

Citing 2
Cited 8

PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric

Journal of the ACM (JACM)
Classifying the Hungarian web

EACL '03 Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1

Word-based dialect identification with georeferenced rules

EMNLP '10 Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing
Improved N-grams approach for web page language identification

Transactions on computational collective intelligence V
A comparison of language identification approaches on short, query-style texts

ECIR'2010 Proceedings of the 32nd European conference on Advances in Information Retrieval
Drive-by language identification: a byproduct of applied prototype semantics

CICLing'10 Proceedings of the 11th international conference on Computational Linguistics and Intelligent Text Processing
Bootstrapped language identification for multi-site internet domains

Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining
A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

ACM Transactions on the Web (TWEB)
On text preprocessing for opinion mining outside of laboratory environments

AMT'12 Proceedings of the 8th international conference on Active Media Technology
Searching for Translated Plagiarism with the Help of Desktop Grids

Journal of Grid Computing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Automated language identification of written text is a well-established research domain that has received considerable attention in the past. By now, efficient and effective algorithms based on character n -grams are in use, mainly with identification based on Markov models or on character n -gram profiles. In this paper we investigate the limitations of these approaches when applied to real-world web pages. The challenges to be overcome include language identification on very short texts, correctly handling texts of unknown language and texts comprised of multiple languages. We propose and evaluate a new method, which constructs language models based on word relevance and addresses these limitations. We also extend our method to allow us to efficiently and automatically segment the input text into blocks of individual languages, in case of multiple-language documents.