Turkish Word N-gram Analyzing Algorithms for a Large Scale Turkish Corpus - TurCo

  • Authors:
  • Yalçin Çebi;Gökhan Dalkiliç

  • Affiliations:
  • -;-

  • Venue:
  • ITCC '04 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

To calculate some statistical properties of a language,first you need to take some samples of that language. Thatsample is called a corpus. An unbalanced large scaleTurkish text corpus (TurCo) having ~362 MB capacityand more than 50 million words was prepared by using12 different resources including web sites and novels inTurkish language. Different algorithms were tested toobtain the n-gram (1 驴 n 驴 5) values. Efficiencies ofdifferent algorithms have been examined by applyingthem onto the each piece of the corpus one by one. Onlydetailed results of the two algorithms created withoutusing database tables are given, because all the otheralgorithms need to run more than one day which makesthose tests inefficient.