Resource selection for domain-specific cross-lingual IR

Authors:
Monica Rogati;Yiming Yang
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2004

Citing 7
Cited 7

A stemming procedure and stopword list for general French corpora

Journal of the American Society for Information Science
Combining multiple sources for short query translation in Chinese-English cross-language information retrieval

IRAL '00 Proceedings of the fifth international workshop on on Information retrieval with Asian languages
The Domain-Specific Task of CLEF - Specific Evaluation Strategies in Cross-Language Information Retrieval

CLEF '00 Revised Papers from the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Mining the Web for bilingual text

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Improved statistical alignment models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
A maximum entropy language model integrating N-grams and topic dependencies for conversational speech recognition

ICASSP '99 Proceedings of the Acoustics, Speech, and Signal Processing, 1999. on 1999 IEEE International Conference - Volume 01

Bootstrapping dictionaries for cross-language information retrieval

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Pragmatic text mining: minimizing human effort to quantify many issues in call logs

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Customizing parallel corpora at the document level

ACLdemo '04 Proceedings of the ACL 2004 on Interactive poster and demonstration sessions
Integrating Cross-Language Hierarchies and Its Application to Retrieving Relevant Documents

ACM Transactions on Asian Language Information Processing (TALIP)
Corpus microsurgery: criteria optimization for medical cross-language ir

Proceedings of the 17th ACM conference on Information and knowledge management
Indexing and weighting of multilingual and mixed documents

Proceedings of the South African Institute of Computer Scientists and Information Technologists Conference on Knowledge, Innovation and Leadership in a Diverse, Multidisciplinary Environment
A cross-lingual framework for web news taxonomy integration

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

An under-explored question in cross-language information retrieval (CLIR) is to what degree the performance of CLIR methods depends on the availability of high-quality translation resources for particular domains. To address this issue, we evaluate several competitive CLIR methods - with different training corpora - on test documents in the medical domain. Our results show severe performance degradation when using a general-purpose training corpus or a commercial machine translation system (SYSTRAN), versus a domain-specific training corpus. A related unexplored question is whether we can improve CLIR performance by systematically analyzing training resources and optimally matching them to target collections. We start exploring this problem by suggesting a simple criterion for automatically matching training resources to target corpora. By using cosine similarity between training and target corpora as resource weights we obtained an average of 5.6% improvement over using all resources with no weights. The same metric yields 99.4% of the performance obtained when an oracle chooses the optimal resource every time.