The influence of basic tokenization on biomedical document retrieval

  • Authors:
  • Dolf Trieschnigg;Wessel Kraaij;Franciska de Jong

  • Affiliations:
  • University of Twente, Enschede, Netherlands;TNO ICT, Delft, Netherlands;University of Twente, Enschede, Netherlands

  • Venue:
  • SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Tokenization is a fundamental preprocessing step in Information Retrieval systems in which text is turned into index terms. This paper quantifies and compares the influence of various simple tokenization techniques on document retrieval effectiveness in two domains: biomedicine and news. As expected, biomedical retrieval is more sensitive to small changes in the tokenization method. The tokenization strategy can make the difference between a mediocre and well performing IR system, especially in the biomedical domain.