Data driven methods for improving mono- and cross-lingual IR performance in noisy environments

Authors:
Antti Järvelin;Tuomas Talvensaari;Anni Järvelin
Affiliations:
University of Tampere, Finland;University of Tampere, Finland;University of Tampere, Finland
Venue:
Proceedings of the second workshop on Analytics for noisy unstructured text data
Year:
2008

Citing 17
Cited 1

Experiments in multilingual information retrieval using the SPIDER system

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Retrieval effectiveness of proper name search methods

Information Processing and Management: an International Journal
The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Resolving ambiguity for cross-language retrieval

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Structured translation for cross-language information retrieval

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
UTACLIR -: general query translation framework for several language pairs

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Statistical transliteration for english-arabic cross language information retrieval

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000–2002

Information Retrieval
Embedding web-based statistical translation models in cross-language information retrieval

Computational Linguistics - Special issue on web as corpus
Translating unknown queries with web corpora for cross-language information retrieval

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Report on the TREC 2004 genomics track

ACM SIGIR Forum
Translating cross-lingual spelling variants using transformation rules

Information Processing and Management: an International Journal
FITE-TRT: a high quality translation technique for OOV words

Proceedings of the 2006 ACM symposium on Applied computing
s-grams: Defining generalized n-grams for information retrieval

Information Processing and Management: an International Journal
Focused web crawling in the acquisition of comparable corpora

Information Retrieval
Effects of aligned corpus quality and size in corpus-based CLIR

ECIR'08 Proceedings of the IR research, 30th European conference on Advances in information retrieval

Managing misspelled queries in IR applications

Information Processing and Management: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

In cross-language information retrieval (CLIR), novel or non-standard expressions, technical terminology, or rare proper nouns can be seen as noise when they appear in queries or in the target collection. This kind of vocabulary is often out-of-vocabulary (OOV) for dictionaries that are used to translate queries. In historic document retrieval (HDR), OCR errors and historical spelling variants cause similar problems. In this paper, three data driven approaches to these problems are presented. The two first methods, the transformation rule based translation (TRT) method and the classified s-gram method, operate on string level. With them approximate matches of a query word can be recognized from the target document collection and included into the target query. In the third method, the corpus-based approach, parallel or comparable corpora are employed to derive translation knowledge that can be used to translate OOV words. Besides the overview of the methods, three case studies highlighting their practical applications in CLIR are also presented. The methods are shown to be effective in query translation without dictionaries between closely related languages (TRT and s-grams), OOV word translation (s-grams), and boosting dictionary-based CLIR performance by way of OOV word translation (corpus based methods).