Lessons from building a Persian written corpus: Peykare

  • Authors:
  • Mahmood Bijankhan;Javad Sheykhzadegan;Mohammad Bahrani;Masood Ghayoomi

  • Affiliations:
  • Department of Linguistics, The University of Tehran, Tehran, Iran;Research Center for Intelligent Signal Processing, Tehran, Iran;Computer Engineering Department, Sharif University of Technology, Tehran, Iran;German Grammar Group, Freie Universität Berlin, Berlin, Germany

  • Venue:
  • Language Resources and Evaluation
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper addresses some of the issues learned during the course of building a written language resource, called `Peykare', for the contemporary Persian. After defining five linguistic varieties and 24 different registers based on these linguistic varieties, we collected texts for Peykare to do a linguistic analysis, including cross-register differences. For tokenization of Persian, we propose a descriptive generalization to normalize orthographic variations existing in texts. To annotate Peykare, we use EAGLES guidelines which result to have a hierarchy in the part-of-speech tags. To this aim, we apply a semi-automatic approach for the annotation methodology. In the paper, we also give a special attention to the Ezafe construction and homographs which are important in Persian text analyses.