An empirical study of tokenization strategies for biomedical information retrieval

Authors:
Jing Jiang;Chengxiang Zhai
Affiliations:
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA 61801;Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, USA 61801
Venue:
Information Retrieval
Year:
2007

Citing 3
Cited 8

Document language models, query models, and risk minimization for information retrieval

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Model-based feedback in the language modeling approach to information retrieval

Proceedings of the tenth international conference on Information and knowledge management
A formal study of information retrieval heuristics

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

TREC genomics special issue overview

Information Retrieval
Passage extraction and result combination for genomics information retrieval

Journal of Intelligent Information Systems
A dynamic window based passage extraction algorithm for genomics information retrieval

ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
Biomedical information retrieval: the BioTracer approach

ITBAM'10 Proceedings of the First international conference on Information technology in bio- and medical informatics
A Unified Architecture for Biomedical Search Engines Based on Semantic Web Technologies

Journal of Medical Systems
Towards morphologically annotated corpus of hospital discharge reports in Polish

BioNLP '11 Proceedings of BioNLP 2011 Workshop
Supporting biomedical information retrieval: the bioTracer approach

Transactions on large-scale data- and knowledge-centered systems IV
Evaluation as a service for information retrieval

ACM SIGIR Forum

Quantified Score

Hi-index	0.01

Visualization

Abstract

Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.