Empirical studies on the impact of lexical resources on CLIR performance

Authors:
Jinxi Xu;Ralph Weischedel
Affiliations:
BBN Technologies, 10 Moulton Street, Cambridge, MA;BBN Technologies, 10 Moulton Street, Cambridge, MA
Venue:
Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval
Year:
2005

Citing 18
Cited 10

Using statistical testing in the evaluation of retrieval experiments

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Pivoted document length normalization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Corpus-based stemming using cooccurrence of word variants

ACM Transactions on Information Systems (TOIS)
Resolving ambiguity for cross-language retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
A hidden Markov model information retrieval system

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Information retrieval as statistical translation

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Evaluating a probabilistic model for cross-lingual information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Quantifying the utility of parallel corpora

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Comparing cross-language query expansion techniques by degrading translation resources

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
TNO at CLEF-2001: Comparing Translation Resources

CLEF '01 Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems
The Effect of Bilingual Term List Size on Dictionary-Based Cross-Language Information Retrieval

HICSS '03 Proceedings of the 36th Annual Hawaii International Conference on System Sciences (HICSS'03) - Track 4 - Volume 4
The mathematics of statistical machine translation: parameter estimation

Computational Linguistics - Special issue on using large corpora: II
Should we translate the documents or the queries in cross-language information retrieval?

ACL '99 Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics
Translating named entities using monolingual and bilingual resources

ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
Improved statistical alignment models

ACL '00 Proceedings of the 38th Annual Meeting on Association for Computational Linguistics
Cross-lingual information retrieval using hidden Markov models

EMNLP '00 Proceedings of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics - Volume 13

Improving query translation with confidence estimation for cross language information retrieval

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Combining resources with confidence measures for cross language information retrieval

Proceedings of the ACM first Ph.D. workshop in CIKM
Extending query translation to cross-language query expansion with markov chain models

Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
Mining named entity transliteration equivalents from comparable corpora

Proceedings of the 17th ACM conference on Information and knowledge management
"They Are Out There, If You Know Where to Look": Mining Transliterations of OOV Query Terms for Cross-Language Information Retrieval

ECIR '09 Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval
MINT: a method for effective and scalable mining of named entity transliterations from large comparable corpora

EACL '09 Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics
Effects of aligned corpus quality and size in corpus-based CLIR

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval
Compositional Machine Transliteration

ACM Transactions on Asian Language Information Processing (TALIP)
Translation techniques in cross-language information retrieval

ACM Computing Surveys (CSUR)
Flat vs. hierarchical phrase-based translation models for cross-language information retrieval

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we compile and review several experiments measuring cross-lingual information retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers. Our CLIR system uses a simple probabilistic language model; the studies used TREC test corpora over Chinese, Spanish and Arabic. Our findings include: • One can achieve an acceptable CLIR performance using only a bilingual term list (70-80% on Chinese and Arabic corpora). • However, if a bilingual term list and parallel corpora are available, CLIR performance can rival monolingual performance. • If no parallel corpus is available, pseudo-parallel texts produced by an MT system can partially overcome the lack of parallel text. • While stemming is useful normally, with a very large parallel corpus for Arabic-English, stemming hurt performance in our empirical studies with Arabic, a highly inflected language.