Rich results from poor resources: NTCIR-4 monolingual and cross-lingual retrieval of korean texts using chinese and english

Authors:
Kui Lam Kwok;Sora Choi;Norbert Dinstl
Affiliations:
Queens College, City University of New York, Flushing, NY;Queens College, City University of New York, Flushing, NY;Queens College, City University of New York, Flushing, NY
Venue:
ACM Transactions on Asian Language Information Processing (TALIP)
Year:
2005

Citing 8
Cited 1

A statistical approach to machine translation

Computational Linguistics
A network approach to probabilistic information retrieval

ACM Transactions on Information Systems (TOIS)
Using n-grams for Korean text retrieval

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Improving cross language retrieval with triangulated translation

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Cross-Language Information Retrieval

Cross-Language Information Retrieval
Fusion Via a Linear Combination of Scores

Information Retrieval
Improving English and Chinese Ad-Hoc Retrieval: A Tipster Text Phase 3 Project Report

Information Retrieval

Exploiting query logs for cross-lingual query suggestions

ACM Transactions on Information Systems (TOIS)

Quantified Score

Hi-index	0.00

Visualization

Abstract

We report on Korean monolingual, Chinese-Korean English-as-pivot bilingual, and Chinese-English bilingual CLIR experiments using MT software augmented with Web-based entity-oriented translation as resources in the NTCIR-4 environment. Simple stemming is helpful in improving bigram indexing for Korean retrieval. For word indexing, keeping nouns only is preferable. Web-based translation reduces untranslated terms left over after MT and substantially improves CLIR results. Translation concatenation is found to consistently improve CLIR effectiveness, while combining a retrieval list from bigram and word indexing is also helpful. A method to disambiguate multiple MT outputs using a log likelihood ratio threshold was tested. Depending on the nature of the title or description queries, bigram only or a retrieval combination, or relaxed or rigid evaluations, direct bilingual CLIR returned an average precision of 71--79% (English-Korean) and 76--84% (Chinese-English) of the corresponding Korean-Korean and English-English monolingual results. Using English as a pivot in Chinese-Korean CLIR provides about 55--65% the effectiveness that Korean alone does. Entity/terminology translation at the pivot language stage accounts for a large portion of this deficiency. A topic with comparatively worse Chinese-English bilingual result does not necessarily mean that it will continue to under-perform (after further transitive Korean translation) at the Korean retrieval level.