Word Statistics of Turkish Language on a Large Scale Text Corpus - TurCo

Authors:
Gökhan Dalkiliç;Yalçin Çebi
Affiliations:
-;-
Venue:
ITCC '04 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2
Year:
2004

Citing 0
Cited 1

Zipf's law and mandelbrot's constants for turkish language using turkish corpus (turco)

ADVIS'04 Proceedings of the Third international conference on Advances in Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Determination of the statistical properties of a naturallanguage is one of the most important part of thelanguage analysis. Number of Different Words (NODW),and Different Word Usage Ratio (DWUR) concepts aresome of the general characteristics of a corpus. Thesevalues are described and calculated for the TurkishCorpus (TurCo). Also, word n-grams are calculated forTurkish which was done for English years ago butcouldn't be done for Turkish because of the lack of alarge scale corpus. Obtained results from n-grams werecompared with the results of the Brown corpus (veryknown corpus for English) and similarity between TurCoand Brown corpus was examined.