Lessons from building a Persian written corpus: Peykare

Authors:
Mahmood Bijankhan;Javad Sheykhzadegan;Mohammad Bahrani;Masood Ghayoomi
Affiliations:
Department of Linguistics, The University of Tehran, Tehran, Iran;Research Center for Intelligent Signal Processing, Tehran, Iran;Computer Engineering Department, Sharif University of Technology, Tehran, Iran;German Grammar Group, Freie Universität Berlin, Berlin, Germany
Venue:
Language Resources and Evaluation
Year:
2011

Citing 5
Cited 0

Foundations of statistical natural language processing

Foundations of statistical natural language processing
Using register-diversified corpora for general language studies

Computational Linguistics - Special issue on using large corpora: II
Morphological tagging: data vs. dictionaries

NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Issues in Arabic orthography and morphology analysis

Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages
Challenges in Developing Persian Corpora from Online Resources

IALP '09 Proceedings of the 2009 International Conference on Asian Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses some of the issues learned during the course of building a written language resource, called `Peykare', for the contemporary Persian. After defining five linguistic varieties and 24 different registers based on these linguistic varieties, we collected texts for Peykare to do a linguistic analysis, including cross-register differences. For tokenization of Persian, we propose a descriptive generalization to normalize orthographic variations existing in texts. To annotate Peykare, we use EAGLES guidelines which result to have a hierarchy in the part-of-speech tags. To this aim, we apply a semi-automatic approach for the annotation methodology. In the paper, we also give a special attention to the Ezafe construction and homographs which are important in Persian text analyses.