Topic based creation of a persian-english comparable corpus

Authors:
Zahra Rahimi;Azadeh Shakery
Affiliations:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran;School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
Venue:
AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Year:
2011

Citing 11
Cited 2

Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Experiments in multilingual information retrieval using the SPIDER system

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Document language models, query models, and risk minimization for information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Multilingual Information Retrieval Based on Document Alignment Techniques

ECDL '98 Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries
Semi-automatic Compilation of Bilingual Lexicon Entries from Cross-Lingually Relevant News Articles on WWW News Sites

AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Mining comparable bilingual text corpora for cross-language information integration

Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora

Computational Linguistics
Creating and exploiting a comparable corpus in cross-language information retrieval

ACM Transactions on Information Systems (TOIS)
Focused web crawling in the acquisition of comparable corpora

Information Retrieval
Creating a Persian-English comparable corpus

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Mining large-scale comparable corpora from Chinese-English news collections

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters

A language modeling approach for extracting translation knowledge from comparable corpora

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Mining a Persian-English comparable corpus for cross-language information retrieval

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most important issues in cross language information retrieval (CLIR) is where to obtain the translation knowledge. Multilingual corpora are valuable resources for this purpose, but few studies have been done on constructing multilingual corpora in Persian language. In this study, we propose a method to construct a Persian- English comparable corpus using two independent news collections and based on date and topic criteria. Unlike most existing methods which use publication dates as the main basis for aligning documents, we also consider date-independent alignments: alignments based only on topics and concept similarities. In order to avoid low quality alignments, we cluster the collections based on their topics prior to alignments which allows us to align similar documents whose publication dates are distant. Evaluation results show the high quality of constructed corpus and the possibility of extracting high quality association knowledge from the corpus for the task of CLIR.