Exploring new languages with HAIRCUT at CLEF 2005

Authors:
Paul McNamee
Affiliations:
The Johns Hopkins University Applied Physics Laboratory, Laurel, MD
Venue:
CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Year:
2005

Citing 7
Cited 3

A language modeling approach to information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Dictionary-Based Cross-Language Information Retrieval: Problems, Methods, and Research Findings

Information Retrieval
Single n-gram stemming

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Character N-Gram Tokenization for European Language Text Retrieval

Information Retrieval
Char_align: a program for aligning parallel texts at the character level

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics
Translating pieces of words

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Letter level learning for language independent diacritics restoration

COLING-02 proceedings of the 6th conference on Natural language learning - Volume 20

Benefits of resource-based stemming in hungarian information retrieval

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
JHU/APL ad hoc experiments at CLEF 2006

CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
Cross-lingual random indexing for information retrieval

SLSP'13 Proceedings of the First international conference on Statistical Language and Speech Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

JHU/APL has long espoused the use of language-neutral methods for cross-language information retrieval. This year we participated in the ad hoc cross-language track and submitted both monolingual and bilingual runs. We undertook our first investigations in the Bulgarian and Hungarian languages. In our bilingual experiments we used several non-traditional CLEF query languages such as Greek, Hungarian, and Indonesian, in addition to several western European languages. We found that character n-grams remain an attractive option for representing documents and queries in these new languages. In our monolingual tests n-grams were more effective than unnormalized words for retrieval in Bulgarian (+30%) and Hungarian (+63%). Our bilingual runs made use of subword translation, statistical translation of character n-grams using aligned corpora, when parallel data were available, and web-based machine translation, when no suitable data could be found.