Compilation of a Spanish Representative Corpus

  • Authors:
  • Alexander F. Gelbukh;Grigori Sidorov;Liliana Chanona-Hernández

  • Affiliations:
  • -;-;-

  • Venue:
  • CICLing '02 Proceedings of the Third International Conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2002

Quantified Score

Hi-index 0.01

Visualization

Abstract

Due to the Zipf law, even a very large corpus contains very few occurrences (tokens) for the majority of its different words (types). Only a corpus containing enough occurrences of even rare words can provide necessary statistical information for the study of contextual usage of words. We call such corpus representative and suggest to use Internet for its compilation. The corresponding algorithm and its application to Spanish are described. Different concepts of a representative corpus are discussed.