Extracting parallel paragraphs and sentences from english-persian translated documents

  • Authors:
  • Mohammad Sadegh Rasooli;Omid Kashefi;Behrouz Minaei-Bidgoli

  • Affiliations:
  • Department of Computer Engineering, Iran University of Science and Technology, Iran;Department of Computer Engineering, Iran University of Science and Technology, Iran;Department of Computer Engineering, Iran University of Science and Technology, Iran

  • Venue:
  • AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

The task of sentence and paragraph alignment is essential for preparing parallel texts that are needed in applications such as machine translation. The lack of sufficient linguistic data for under-resourced languages like Persian is a challenging issue. In this paper, we proposed a hybrid sentence and paragraph alignment model on Persian-English parallel documents based on simple linguistic features as well as length similarity between sentences and paragraphs of source and target languages. We apply a small bilingual dictionary of Persian-English nouns, punctuation marks, and length similarity as alignment metrics. We combine these features in a linear model and use genetic algorithm to learn the linear equation weights. Evaluation results show that the extracted features improve the baseline model which is only a length-based one.