Character N-Gram Tokenization for European Language Text Retrieval
Information Retrieval
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
Exploring new languages with HAIRCUT at CLEF 2005
CLEF'05 Proceedings of the 6th international conference on Cross-Language Evalution Forum: accessing Multilingual Information Repositories
Hi-index | 0.00 |
JHU/APL continued its participation in multilingual retrieval at CLEF in 2006. We again applied our hallmark technique for combating language diversity and morphological complexity: character n-gram tokenization. This year we participated in the ad hoc cross-language track and submitted both monolingual and bilingual runs. Our experimental results this year agree with our previous reports that n-grams perform especially well in linguistically complex languages, notably Bulgarian and Hungarian, where monolingual improvements of 27% and 70% respectively were observed compared to space-delimited word forms. As in CLEF 2005, our bilingual submissions made use of subword translation, statistical translation of character n-grams using aligned corpora, when parallel data were available, and webbased machine translation, when no suitable data was available to us.