A comparison study of biomedical short form definition detection algorithms

  • Authors:
  • Manabu Torii;Hongfang Liu;Zhangzhi Hu;Cathy Wu

  • Affiliations:
  • Georgetown University Medical Center, Washington, DC;Georgetown University Medical Center, Washington, DC;Georgetown University Medical Center, Washington, DC;Georgetown University Medical Center, Washington, DC

  • Venue:
  • TMBIO '06 Proceedings of the 1st international workshop on Text mining in bioinformatics
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

With more and more research dedicated to literature mining in the biomedical domain, more and more systems are available for people to choose from to build literature mining applications. In this study, we focus on one specific kind of task, i.e., detecting definitions of acronyms/abbreviations/symbols in biomedical text. The study was designed to answer the following questions; i) how well a system performs in detecting definitions when provided with a large set of documents recently published in the biomedical domain, ii) what the coverage is for various knowledge bases in including acronyms/abbreviations/symbols as synonyms of their definitions, and iii) how to combine results from various systems. We evaluated three publicly available systems, namely, ALICE (a handcrafted pattern/rule based system), a system by Chang et al. (a machine-learning system), and an algorithm by Schwartz and Hearst (a simple alignment-based program), in detecting definitions for acronyms/abbreviations/symbols as well as the conceptual coverage of existing thesauri, namely, the UMLS (the Unified Medical Language System) and the BioThesaurus (a thesaurus of names for all UniProt protein records). We found that all three systems agreed on a large portion of the results (over 94% of all definitions detected) mainly due to the fact that most acronyms/abbreviations/symbols were formed through various initializations from their definitions. The precisions and recalls of the three systems are comparable. However, based on manual investigation of the results, we found that most systems have some difficulty in detecting definitions for chemical/gene/protein symbols where ALICE has relatively better performance of chemical/gene/protein symbols comparing to the other two possibly due to fine tuning of the system for those symbols. We also found existing knowledge bases have a good coverage of definitions for those frequently defined acronyms/abbreviations/symbols. Potential combinations of the three systems were also discussed and implemented.