Document language models, query models, and risk minimization for information retrieval
Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Model-based feedback in the language modeling approach to information retrieval
Proceedings of the tenth international conference on Information and knowledge management
A formal study of information retrieval heuristics
Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
TREC genomics special issue overview
Information Retrieval
Passage extraction and result combination for genomics information retrieval
Journal of Intelligent Information Systems
A dynamic window based passage extraction algorithm for genomics information retrieval
ISMIS'08 Proceedings of the 17th international conference on Foundations of intelligent systems
Biomedical information retrieval: the BioTracer approach
ITBAM'10 Proceedings of the First international conference on Information technology in bio- and medical informatics
A Unified Architecture for Biomedical Search Engines Based on Semantic Web Technologies
Journal of Medical Systems
Towards morphologically annotated corpus of hospital discharge reports in Polish
BioNLP '11 Proceedings of BioNLP 2011 Workshop
Supporting biomedical information retrieval: the bioTracer approach
Transactions on large-scale data- and knowledge-centered systems IV
Evaluation as a service for information retrieval
ACM SIGIR Forum
Hi-index | 0.01 |
Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.