Challenges in Developing Persian Corpora from Online Resources

  • Authors:
  • Masood Ghayoomi;Saeedeh Momtazi

  • Affiliations:
  • -;-

  • Venue:
  • IALP '09 Proceedings of the 2009 International Conference on Asian Language Processing
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

Persian is one of the Indo-European languages which has borrowed its script from Arabic, a member of Semitic language family. Since Persian and Arabic scripts are so similar, problems arise when we want to process an electronic text. In this paper, some of the common problems faced experimentally in developing a corpus for Persian from on-line materials are discussed. The sources of the problems are the Persian script itself; mixture with the Arabic script; Persian orthography; the typists’ typing styles; and mixing Persian code pages with Arabic code pages in operating systems.