Foundations of statistical natural language processing
Foundations of statistical natural language processing
Using register-diversified corpora for general language studies
Computational Linguistics - Special issue on using large corpora: II
Morphological tagging: data vs. dictionaries
NAACL 2000 Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference
Issues in Arabic orthography and morphology analysis
Semitic '04 Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages
Challenges in Developing Persian Corpora from Online Resources
IALP '09 Proceedings of the 2009 International Conference on Asian Language Processing
Hi-index | 0.00 |
This paper addresses some of the issues learned during the course of building a written language resource, called `Peykare', for the contemporary Persian. After defining five linguistic varieties and 24 different registers based on these linguistic varieties, we collected texts for Peykare to do a linguistic analysis, including cross-register differences. For tokenization of Persian, we propose a descriptive generalization to normalize orthographic variations existing in texts. To annotate Peykare, we use EAGLES guidelines which result to have a hierarchy in the part-of-speech tags. To this aim, we apply a semi-automatic approach for the annotation methodology. In the paper, we also give a special attention to the Ezafe construction and homographs which are important in Persian text analyses.