Improved N-grams approach for web page language identification

  • Authors:
  • Ali Selamat

  • Affiliations:
  • Faculty of Computer Science & Information Systems, Universiti Teknologi Malaysia, Johor, Malaysia

  • Venue:
  • Transactions on computational collective intelligence V
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Language identification has been widely used for machine translations and information retrieval. In this paper, an improved Ngrams (ING) approach is proposed for web page language identification. The improved N-grams approach is based on a combination of original N-grams (ONG) approach and a modified N-grams (MNG) approach that has been used for language identification of web documents. The features selected from the improved N-grams approach are based on Ngrams frequency and N-grams position. The features selected from the original N-grams approach are based on a distance measurement and the features selected from the modified N-grams approach are based on a Boolean matching rate for language identification of Roman and Arabic scripts web pages. A large real-world document collection from British Broadcasting Corporation (BBC) website, which is composed of 1000 documents on each of the languages (e.g., Azeri, English, Indonesian, Serbian, Somali, Spanish, Turkish, Vietnamese, Arabic, Persian, Urdu, Pashto) have been used for evaluations. The precision, recall and F1 measures have been used to determine the effectiveness of the proposed improved N-grams (ING) approach. From the experiments, we have found that the improved N-grams approach has been able to improve the language identification of the contents in Roman and Arabic scripts web page documents from the available datasets.