A Chinese dictionary construction algorithm for information retrieval
ACM Transactions on Asian Language Information Processing (TALIP)
Multilingual ICT education: language observatory as a monitoring instrument
SEARCC '05 Proceedings of the 2005 South East Asia Regional Computer Science Confederation (SEARCC) Conference - Volume 46
The Design of Backend Classifiers in PPRLM System for Language Identification
ICNC '07 Proceedings of the Third International Conference on Natural Computation - Volume 01
Language Identification on the Web: Extending the Dictionary Method
CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
A Hierarchical System Design for Language Identification
ISISE '09 Proceedings of the 2009 Second International Symposium on Information Science and Engineering
Improve feature selection method of web page language identification using fuzzy ARTMAP
International Journal of Intelligent Information and Database Systems
HAIS'10 Proceedings of the 5th international conference on Hybrid Artificial Intelligence Systems - Volume Part I
A Vector Space Modeling Approach to Spoken Language Identification
IEEE Transactions on Audio, Speech, and Language Processing
Identifying Language Origin of Named Entity With Multiple Information Sources
IEEE Transactions on Audio, Speech, and Language Processing
Automatic Prosodic Variations Modeling for Language and Dialect Discrimination
IEEE Transactions on Audio, Speech, and Language Processing
Hi-index | 0.00 |
Language identification has been widely used for machine translations and information retrieval. In this paper, an improved Ngrams (ING) approach is proposed for web page language identification. The improved N-grams approach is based on a combination of original N-grams (ONG) approach and a modified N-grams (MNG) approach that has been used for language identification of web documents. The features selected from the improved N-grams approach are based on Ngrams frequency and N-grams position. The features selected from the original N-grams approach are based on a distance measurement and the features selected from the modified N-grams approach are based on a Boolean matching rate for language identification of Roman and Arabic scripts web pages. A large real-world document collection from British Broadcasting Corporation (BBC) website, which is composed of 1000 documents on each of the languages (e.g., Azeri, English, Indonesian, Serbian, Somali, Spanish, Turkish, Vietnamese, Arabic, Persian, Urdu, Pashto) have been used for evaluations. The precision, recall and F1 measures have been used to determine the effectiveness of the proposed improved N-grams (ING) approach. From the experiments, we have found that the improved N-grams approach has been able to improve the language identification of the contents in Roman and Arabic scripts web page documents from the available datasets.