Creating and exploiting a comparable corpus in cross-language information retrieval

Authors:
Tuomas Talvensaari;Jorma Laurikkala;Kalervo Järvelin;Martti Juhola;Heikki Keskustalo
Affiliations:
University of Tampere, Finland;University of Tampere, Finland;University of Tampere, Finland;University of Tampere, Finland;University of Tampere, Finland
Venue:
ACM Transactions on Information Systems (TOIS)
Year:
2007

Citing 17
Cited 14

Combining automatic and manual index representations in probabilistic retrieval

Journal of the American Society for Information Science
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Experiments in multilingual information retrieval using the SPIDER system

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
A new method of weighting query terms for ad-hoc retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Resolving ambiguity for cross-language retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Employing the resolution power of search keys

Journal of the American Society for Information Science and Technology
Comparing cross-language query expansion techniques by degrading translation resources

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
UTACLIR -: general query translation framework for several language pairs

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings

Information Retrieval
Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000–2002

Information Retrieval
An IR approach for translating new words from nonparallel, comparable texts

COLING '98 Proceedings of the 17th international conference on Computational linguistics - Volume 1
A program for aligning sentences in bilingual corpora

ACL '91 Proceedings of the 29th annual meeting on Association for Computational Linguistics
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Experiments with dictionary-based CLIR using graded relevance assessments: Improving effectiveness by pseudo-relevance feedback

Information Retrieval
Corpus-based cross-language information retrieval in retrieval of highly relevant documents: Research Articles

Journal of the American Society for Information Science and Technology

Corpus-based cross-language information retrieval in retrieval of highly relevant documents: Research Articles

Journal of the American Society for Information Science and Technology
Focused web crawling in the acquisition of comparable corpora

Information Retrieval
MARS: multilingual access and retrieval system with enhanced query translation and document retrieval

ACLDemos '09 Proceedings of the ACL-IJCNLP 2009 Software Demonstrations
Effects of aligned corpus quality and size in corpus-based CLIR

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Multilingual pseudo-relevance feedback: performance study of assisting languages

ACL '10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics
Creating a Persian-English comparable corpus

CLEF'10 Proceedings of the 2010 international conference on Multilingual and multimodal information access evaluation: cross-language evaluation forum
Machine transliteration survey

ACM Computing Surveys (CSUR)
Mining large-scale comparable corpora from Chinese-English news collections

COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
Topic based creation of a persian-english comparable corpus

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Mining a multilingual association dictionary from Wikipedia for cross-language information retrieval

Journal of the American Society for Information Science and Technology
Termhood-Based comparability metrics of comparable corpus in special domain

CLSW'12 Proceedings of the 13th Chinese conference on Chinese Lexical Semantics
A language modeling approach for extracting translation knowledge from comparable corpora

ECIR'13 Proceedings of the 35th European conference on Advances in Information Retrieval
Finding synonyms and other semantically-similar terms from coselection data

AWC '13 Proceedings of the First Australasian Web Conference - Volume 144
Mining a Persian-English comparable corpus for cross-language information retrieval

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present a method for creating a comparable text corpus from two document collections in different languages. The collections can be very different in origin. In this study, we build a comparable corpus from articles by a Swedish news agency and a U.S. newspaper. The keys with best resolution power were extracted from the documents of one collection, the source collection, by using the relative average term frequency (RATF) value. The keys were translated into the language of the other collection, the target collection, with a dictionary-based query translation program. The translated queries were run against the target collection and an alignment pair was made if the retrieved documents matched given date and similarity score criteria. The resulting comparable collection was used as a similarity thesaurus to translate queries along with a dictionary-based translator. The combined approaches outperformed translation schemes where dictionary-based translation or corpus translation was used alone.