Letter based text scoring method for language identification

  • Authors:
  • Hidayet Takcı;İbrahim Soğukpınar

  • Affiliations:
  • Gebze Institute of Technology, Gebze /Kocaeli;Gebze Institute of Technology, Gebze /Kocaeli

  • Venue:
  • ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

In recent years, an unexpected amount of growth has been observed in the volume of text documents on the internet, intranet, digital libraries and news groups. It is an important issue to obtain useful information and meaningful patterns from these documents. Identification of Languages of these text documents is an important problem which is studied by many researchers. In these researches generally words (terms) have been used for language identification. Researchers have studied on different approaches like linguistic and statistical based. In this work, Letter Based Text Scoring Method has been proposed for language identification. This method is based on letter distributions of texts. Text scoring has been performed to identify the language of each text document. Text scores are calculated by using letter distributions of new text document. Besides its acceptable accuracy proposed method is easier and faster than short terms and n-gram methods.