Improved N-grams approach for web page language identification

Authors:
Ali Selamat
Affiliations:
Faculty of Computer Science & Information Systems, Universiti Teknologi Malaysia, Johor, Malaysia
Venue:
Transactions on computational collective intelligence V
Year:
2011

Citing 11
Cited 0

A Chinese dictionary construction algorithm for information retrieval

ACM Transactions on Asian Language Information Processing (TALIP)
Multilingual ICT education: language observatory as a monitoring instrument

SEARCC '05 Proceedings of the 2005 South East Asia Regional Computer Science Confederation (SEARCC) Conference - Volume 46
The Design of Backend Classifiers in PPRLM System for Language Identification

ICNC '07 Proceedings of the Third International Conference on Natural Computation - Volume 01
Language Identification on the Web: Extending the Dictionary Method

CICLing '09 Proceedings of the 10th International Conference on Computational Linguistics and Intelligent Text Processing
A Hierarchical System Design for Language Identification

ISISE '09 Proceedings of the 2009 Second International Symposium on Information Science and Engineering
Arabic script web page language identifications using decision tree neural networks

Pattern Recognition
Improve feature selection method of web page language identification using fuzzy ARTMAP

International Journal of Intelligent Information and Database Systems
Hybrid approach for language identification oriented to multilingual speech recognition in the basque context

HAIS'10 Proceedings of the 5th international conference on Hybrid Artificial Intelligence Systems - Volume Part I
A Vector Space Modeling Approach to Spoken Language Identification

IEEE Transactions on Audio, Speech, and Language Processing
Identifying Language Origin of Named Entity With Multiple Information Sources

IEEE Transactions on Audio, Speech, and Language Processing
Automatic Prosodic Variations Modeling for Language and Dialect Discrimination

IEEE Transactions on Audio, Speech, and Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

Language identification has been widely used for machine translations and information retrieval. In this paper, an improved Ngrams (ING) approach is proposed for web page language identification. The improved N-grams approach is based on a combination of original N-grams (ONG) approach and a modified N-grams (MNG) approach that has been used for language identification of web documents. The features selected from the improved N-grams approach are based on Ngrams frequency and N-grams position. The features selected from the original N-grams approach are based on a distance measurement and the features selected from the modified N-grams approach are based on a Boolean matching rate for language identification of Roman and Arabic scripts web pages. A large real-world document collection from British Broadcasting Corporation (BBC) website, which is composed of 1000 documents on each of the languages (e.g., Azeri, English, Indonesian, Serbian, Somali, Spanish, Turkish, Vietnamese, Arabic, Persian, Urdu, Pashto) have been used for evaluations. The precision, recall and F1 measures have been used to determine the effectiveness of the proposed improved N-grams (ING) approach. From the experiments, we have found that the improved N-grams approach has been able to improve the language identification of the contents in Roman and Arabic scripts web page documents from the available datasets.