Detecting invalid dictionary entries for biomedical text mining

Authors:
Hironori Takeuchi;Issei Yoshida;Yohei Ikawa;Kazuo Iida;Yoko Fukui
Affiliations:
IBM Research, Tokyo Research Laboratory, IBM Japan, Ltd., Shimotsuruma, Yamato-shi Kanagawa, Japan;IBM Research, Tokyo Research Laboratory, IBM Japan, Ltd., Shimotsuruma, Yamato-shi Kanagawa, Japan;IBM Research, Tokyo Research Laboratory, IBM Japan, Ltd., Shimotsuruma, Yamato-shi Kanagawa, Japan;Research Institute of Bio-system Informatics, Tohoku Chemical Co., Ltd., Japan;Research Institute of Bio-system Informatics, Tohoku Chemical Co., Ltd., Japan
Venue:
KDLL'06 Proceedings of the 2006 international conference on Knowledge Discovery in Life Science Literature
Year:
2006

Citing 5
Cited 0

Probabilistic term variant generator for biomedical terms

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Text analysis and knowledge mining system

IBM Systems Journal
A text-mining system for knowledge discovery from biomedical documents

IBM Systems Journal
Term identification in the biomedical literature

Journal of Biomedical Informatics - Special issue: Named entity recognition in biomedicine
BioThesaurus: a web-based thesaurus of protein and gene names

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

In text mining, to calculate precise keyword frequency distributions in a particular document collection, we need to map different keywords that denote the same entity to a canonical form. In the life science domain, we can construct a large dictionary that contains the canonical forms and their variants based on the information from external resources and use this dictionary for the term aggregation. However, in this automatically generated dictionary, there are many invalid entries that have negative effects on the calculations of keyword frequencies. In this paper, we propose and test methods to detect invalid entries in the dictionary.