Data driven methods for improving mono- and cross-lingual IR performance in noisy environments

  • Authors:
  • Antti Järvelin;Tuomas Talvensaari;Anni Järvelin

  • Affiliations:
  • University of Tampere, Finland;University of Tampere, Finland;University of Tampere, Finland

  • Venue:
  • Proceedings of the second workshop on Analytics for noisy unstructured text data
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

In cross-language information retrieval (CLIR), novel or non-standard expressions, technical terminology, or rare proper nouns can be seen as noise when they appear in queries or in the target collection. This kind of vocabulary is often out-of-vocabulary (OOV) for dictionaries that are used to translate queries. In historic document retrieval (HDR), OCR errors and historical spelling variants cause similar problems. In this paper, three data driven approaches to these problems are presented. The two first methods, the transformation rule based translation (TRT) method and the classified s-gram method, operate on string level. With them approximate matches of a query word can be recognized from the target document collection and included into the target query. In the third method, the corpus-based approach, parallel or comparable corpora are employed to derive translation knowledge that can be used to translate OOV words. Besides the overview of the methods, three case studies highlighting their practical applications in CLIR are also presented. The methods are shown to be effective in query translation without dictionaries between closely related languages (TRT and s-grams), OOV word translation (s-grams), and boosting dictionary-based CLIR performance by way of OOV word translation (corpus based methods).