Assessment of approximate string matching in a biomedical text retrieval problem

Authors:
J. F. Wang;Z. R. Li;C. Z. Cai;Y. Z. Chen
Affiliations:
Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore;Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore and Department of Chemistry, Sichuan University, Chengdu 61 ...;Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore and Department of Applied Physics, Chongqing University, Ch ...;Department of Computational Science, National University of Singapore, Blk SOC1, Level 7, 3 Science Drive 2, Singapore 117543, Singapore
Venue:
Computers in Biology and Medicine
Year:
2005

Citing 5
Cited 1

Techniques for automatically correcting words in text

ACM Computing Surveys (CSUR)
Fast string matching with mismatches

Information and Computation
A guided tour to approximate string matching

ACM Computing Surveys (CSUR)
Information Retrieval

Information Retrieval
Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions

Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology

Efficient approximate entity extraction with edit distance constraints

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Quantified Score

Hi-index	0.00

Visualization

Abstract

Text-based search is widely used for biomedical data mining and knowledge discovery. Character errors in literatures affect the accuracy of data mining. Methods for solving this problem are being explored. This work tests the usefulness of the Smith-Waterman algorithm with affine gap penalty as a method for biomedical literature retrieval. Names of medicinal herbs collected from herbal medicine literatures are matched with those from medicinal chemistry literatures by using this algorithm at different string identity levels (80-100%). The optimum performance is at string identity of 88%, at which the recall and precision are 96.9% and 97.3%, respectively. Our study suggests that the Smith-Waterman algorithm is useful for improving the success rate of biomedical text retrieval.