Pivoted document length normalization
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Experiments in multilingual information retrieval using the SPIDER system
SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Document language models, query models, and risk minimization for information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Multilingual Information Retrieval Based on Document Alignment Techniques
ECDL '98 Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries
AMTA '02 Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users
Mining comparable bilingual text corpora for cross-language information integration
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
Improving Machine Translation Performance by Exploiting Non-Parallel Corpora
Computational Linguistics
Creating and exploiting a comparable corpus in cross-language information retrieval
ACM Transactions on Information Systems (TOIS)
Focused web crawling in the acquisition of comparable corpora
Information Retrieval
Creating a Persian-English comparable corpus
CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Mining large-scale comparable corpora from Chinese-English news collections
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
A language modeling approach for extracting translation knowledge from comparable corpora
ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Mining a Persian-English comparable corpus for cross-language information retrieval
Information Processing and Management: an International Journal
Hi-index | 0.00 |
One of the most important issues in cross language information retrieval (CLIR) is where to obtain the translation knowledge. Multilingual corpora are valuable resources for this purpose, but few studies have been done on constructing multilingual corpora in Persian language. In this study, we propose a method to construct a Persian- English comparable corpus using two independent news collections and based on date and topic criteria. Unlike most existing methods which use publication dates as the main basis for aligning documents, we also consider date-independent alignments: alignments based only on topics and concept similarities. In order to avoid low quality alignments, we cluster the collections based on their topics prior to alignments which allows us to align similar documents whose publication dates are distant. Evaluation results show the high quality of constructed corpus and the possibility of extracting high quality association knowledge from the corpus for the task of CLIR.