TEP: Tehran English-Persian parallel corpus

  • Authors:
  • Mohammad Taher Pilevar;Heshaam Faili;Abdol Hamid Pilevar

  • Affiliations:
  • Natural Language Processing Laboratory, University of Tehran, Iran;Natural Language Processing Laboratory, University of Tehran, Iran;Faculty of Computer Engineering, Bu Ali Sina University, Hamedan, Iran

  • Venue:
  • CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Parallel corpora are one of the key resources in natural language processing. In spite of their importance in many multi-lingual applications, no large-scale English-Persian corpus has been made available so far, given the difficulties in its creation and the intensive labors required. In this paper, the construction process of Tehran English-Persian parallel corpus (TEP) using movie subtitles, together with some of the difficulties we experienced during data extraction and sentence alignment are addressed. To the best of our knowledge, TEP has been the first freely released large-scale (in order of million words) English-Persian parallel corpus.