Creating a Persian-English comparable corpus

Authors:
Homa Baradaran Hashemi;Azadeh Shakery;Heshaam Faili
Affiliations:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran;School of Electrical and Computer Engineering, College of Engineering, University of Tehran;School of Electrical and Computer Engineering, College of Engineering, University of Tehran
Venue:
CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Year:
2010

Citing 14
Cited 2

Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
Experiments in multilingual information retrieval using the SPIDER system

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Document language models, query models, and risk minimization for information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Comparing cross-language query expansion techniques by degrading translation resources

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Multilingual Information Retrieval Based on Document Alignment Techniques

ECDL '98 Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries
Semi-automatic Compilation of Bilingual Lexicon Entries from Cross-Lingually Relevant News Articles on WWW News Sites

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Multext-East: parallel and comparable corpora and lexicons for six Central and Eastern European languages

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Building parallel corpora by automatic title alignment using length-based and text-based approaches

Information Processing and Management: an International Journal
Mining comparable bilingual text corpora for cross-language information integration

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Creating and exploiting a comparable corpus in cross-language information retrieval

ACM Transactions on Information Systems (TOIS)
Focused web crawling in the acquisition of comparable corpora

Information Retrieval
Hamshahri: A standard Persian text collection

Knowledge-Based Systems

Topic based creation of a persian-english comparable corpus

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Mining a Persian-English comparable corpus for cross-language information retrieval

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in English and Hamshahri news in Persian. We use the similarity of the document topics and their publication dates to align the documents in these sets. We tried several alternatives for constructing the comparable corpora and assessed the quality of the corpora using different criteria. Evaluation results show the high quality of the aligned documents and using the Persian-English comparable corpus for extracting translation knowledge seems promising.