Topic based creation of a persian-english comparable corpus

  • Authors:
  • Zahra Rahimi;Azadeh Shakery

  • Affiliations:
  • School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran;School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran

  • Venue:
  • AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

One of the most important issues in cross language information retrieval (CLIR) is where to obtain the translation knowledge. Multilingual corpora are valuable resources for this purpose, but few studies have been done on constructing multilingual corpora in Persian language. In this study, we propose a method to construct a Persian- English comparable corpus using two independent news collections and based on date and topic criteria. Unlike most existing methods which use publication dates as the main basis for aligning documents, we also consider date-independent alignments: alignments based only on topics and concept similarities. In order to avoid low quality alignments, we cluster the collections based on their topics prior to alignments which allows us to align similar documents whose publication dates are distant. Evaluation results show the high quality of constructed corpus and the possibility of extracting high quality association knowledge from the corpus for the task of CLIR.