Creating a Persian-English comparable corpus

  • Authors:
  • Homa Baradaran Hashemi;Azadeh Shakery;Heshaam Faili

  • Affiliations:
  • School of Electrical and Computer Engineering, College of Engineering, University of Tehran;School of Electrical and Computer Engineering, College of Engineering, University of Tehran;School of Electrical and Computer Engineering, College of Engineering, University of Tehran

  • Venue:
  • CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in English and Hamshahri news in Persian. We use the similarity of the document topics and their publication dates to align the documents in these sets. We tried several alternatives for constructing the comparable corpora and assessed the quality of the corpora using different criteria. Evaluation results show the high quality of the aligned documents and using the Persian-English comparable corpus for extracting translation knowledge seems promising.