Alignment-HMM-based extraction of abbreviations from biomedical text

Authors:
Dana Movshovitz-Attias;William W. Cohen
Affiliations:
Carnegie Mellon University, Pittsburgh, PA;Carnegie Mellon University, Pittsburgh, PA
Venue:
BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Year:
2012

Citing 10
Cited 0

Learning String-Edit Distance

IEEE Transactions on Pattern Analysis and Machine Intelligence
Acrophile: an automated acronym extractor and server

DL '00 Proceedings of the fifth ACM conference on Digital libraries
Adaptive duplicate detection using learnable string similarity measures

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
SaRAD: a Simple and Robust Abbreviation Dictionary

Bioinformatics
Resolving abbreviations to their senses in Medline

Bioinformatics
Medstract: creating large-scale information servers for biomedical libraries

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
ADAM: another database of abbreviations in MEDLINE

Bioinformatics
Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles

Journal of Biomedical Informatics
Disease mention recognition with specific features

BioNLP '10 Proceedings of the 2010 Workshop on Biomedical Natural Language Processing
GeneTUKit

Bioinformatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

We present an algorithm for extracting abbreviation definitions from biomedical text. Our approach is based on an alignment HMM, matching abbreviations and their definitions. We report 98% precision and 93% recall on a standard data set, and 95% precision and 91% recall on an additional test set. Our results show an improvement over previously reported methods and our model has several advantages. Our model: (1) is simpler and faster than a comparable alignment-based abbreviation extractor; (2) is naturally generalizable to specific types of abbreviations, e.g., abbreviations of chemical formulas; (3) is trained on a set of unlabeled examples; and (4) associates a probability with each predicted definition. Using the abbreviation alignment model we were able to extract over 1.4 million abbreviations from a corpus of 200K full-text PubMed papers, including 455,844 unique definitions.