Zipf's law and mandelbrot's constants for turkish language using turkish corpus (turco)
ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems
Hi-index | 0.00 |
To calculate some statistical properties of a language,first you need to take some samples of that language. Thatsample is called a corpus. An unbalanced large scaleTurkish text corpus (TurCo) having ~362 MB capacityand more than 50 million words was prepared by using12 different resources including web sites and novels inTurkish language. Different algorithms were tested toobtain the n-gram (1 驴 n 驴 5) values. Efficiencies ofdifferent algorithms have been examined by applyingthem onto the each piece of the corpus one by one. Onlydetailed results of the two algorithms created withoutusing database tables are given, because all the otheralgorithms need to run more than one day which makesthose tests inefficient.