Turkish Word N-gram Analyzing Algorithms for a Large Scale Turkish Corpus - TurCo

Authors:
Yalçin Çebi;Gökhan Dalkiliç
Affiliations:
-;-
Venue:
ITCC '04 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2
Year:
2004

Citing 0
Cited 1

Zipf's law and mandelbrot's constants for turkish language using turkish corpus (turco)

ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

To calculate some statistical properties of a language,first you need to take some samples of that language. Thatsample is called a corpus. An unbalanced large scaleTurkish text corpus (TurCo) having ~362 MB capacityand more than 50 million words was prepared by using12 different resources including web sites and novels inTurkish language. Different algorithms were tested toobtain the n-gram (1 驴 n 驴 5) values. Efficiencies ofdifferent algorithms have been examined by applyingthem onto the each piece of the corpus one by one. Onlydetailed results of the two algorithms created withoutusing database tables are given, because all the otheralgorithms need to run more than one day which makesthose tests inefficient.