TEP: Tehran English-Persian parallel corpus

Authors:
Mohammad Taher Pilevar;Heshaam Faili;Abdol Hamid Pilevar
Affiliations:
Natural Language Processing Laboratory, University of Tehran, Iran;Natural Language Processing Laboratory, University of Tehran, Iran;Faculty of Computer Engineering, Bu Ali Sina University, Hamedan, Iran
Venue:
CICLing'11 Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part II
Year:
2011

Citing 3
Cited 2

A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Train the machine with what it can learn: corpus selection for SMT

BUCC '09 Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
Building Bilingual Parallel Corpora Based on Wikipedia

ICCEA '10 Proceedings of the 2010 Second International Conference on Computer Engineering and Applications - Volume 02

Extracting parallel paragraphs and sentences from english-persian translated documents

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
ONTS: "optima" news translation system

EACL '12 Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Parallel corpora are one of the key resources in natural language processing. In spite of their importance in many multi-lingual applications, no large-scale English-Persian corpus has been made available so far, given the difficulties in its creation and the intensive labors required. In this paper, the construction process of Tehran English-Persian parallel corpus (TEP) using movie subtitles, together with some of the difficulties we experienced during data extraction and sentence alignment are addressed. To the best of our knowledge, TEP has been the first freely released large-scale (in order of million words) English-Persian parallel corpus.