Word Statistics of Turkish Language on a Large Scale Text Corpus - TurCo

  • Authors:
  • Gökhan Dalkiliç;Yalçin Çebi

  • Affiliations:
  • -;-

  • Venue:
  • ITCC '04 Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC'04) Volume 2 - Volume 2
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

Determination of the statistical properties of a naturallanguage is one of the most important part of thelanguage analysis. Number of Different Words (NODW),and Different Word Usage Ratio (DWUR) concepts aresome of the general characteristics of a corpus. Thesevalues are described and calculated for the TurkishCorpus (TurCo). Also, word n-grams are calculated forTurkish which was done for English years ago butcouldn't be done for Turkish because of the lack of alarge scale corpus. Obtained results from n-grams werecompared with the results of the Brown corpus (veryknown corpus for English) and similarity between TurCoand Brown corpus was examined.