MorphoSaurus in ImageCLEF 2006: the effect of subwords on biomedical IR

  • Authors:
  • Philipp Daumke;Jan Paetzold;Kornel Marko

  • Affiliations:
  • University Hospital Freiburg, Dept. of Medical Informatics, Freiburg, Germany;University Hospital Freiburg, Dept. of Medical Informatics, Freiburg, Germany;University Hospital Freiburg, Dept. of Medical Informatics, Freiburg, Germany

  • Venue:
  • CLEF'06 Proceedings of the 7th international conference on Cross-Language Evaluation Forum: evaluation of multilingual and multi-modal information retrieval
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

In the 2006 ImageCLEF Medical Image Retrieval task we evaluate the effects of deep morphological analysis for mono-and cross-lingual document retrieval in the biomedical domain. The morphological analysis is based on the MorphoSaurus system in which subwords are introduced as morphologically meaningful word units. Subwords are organized in language specific lexica that were partly manually and partly automatically generated and currently cover six European languages. They are linked together in a multilingual thesaurus. The use of subwords instead of full words significantly reduces the number of lexical entries that are needed to sufficiently cover a specific language and domain. A further benefit of the approach is its independence from the underlying retrieval system. We combined MorphoSaurus with the open-source search engine Lucene and achieved precision gains of up to 25% over the baseline for a monolingual setting and promising results in a multilingual scenario.