Identifying well-formed biomedical phrases in MEDLINE® text

Authors:
Won Kim;Lana Yeganova;Donald C. Comeau;W. John Wilbur
Affiliations:
National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
Venue:
Journal of Biomedical Informatics
Year:
2012

Citing 17
Cited 0

Elements of machine learning

Elements of machine learning
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Modern Information Retrieval

Modern Information Retrieval
The C-value/NC-value Method of Automatic Recognition for Multi-Word Terms

ECDL '98 Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries
Recommender systems using linear classifiers

The Journal of Machine Learning Research
Extracting noun phrases from large-scale texts: a hybrid approach and its automatic evaluation

ACL '94 Proceedings of the 32nd annual meeting on Association for Computational Linguistics
Noun-phrase analysis in unrestricted text for information retrieval

ACL '96 Proceedings of the 34th annual meeting on Association for Computational Linguistics
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Solving large scale linear prediction problems using stochastic gradient descent algorithms

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Identifying Variable-Length Meaningful Phrases with Correlation Functions

ICTAI '04 Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence
MedPost: a part-of-speech tagger for bioMedical text

Bioinformatics
Paradigmatic modifiability statistics for the extraction of complex multi-word terms

HLT '05 Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing
How to interpret PubMed queries and why it matters

Journal of the American Society for Information Science and Technology
The ineffectiveness of within-document term frequency in text classification

Information Retrieval
SMOTE: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research
Handbook of Natural Language Processing

Handbook of Natural Language Processing
Text mining techniques for leveraging positively labeled data

BioNLP '11 Proceedings of BioNLP 2011 Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the modern world people frequently interact with retrieval systems to satisfy their information needs. Humanly understandable well-formed phrases represent a crucial interface between humans and the web, and the ability to index and search with such phrases is beneficial for human-web interactions. In this paper we consider the problem of identifying humanly understandable, well formed, and high quality biomedical phrases in MEDLINE documents. The main approaches used previously for detecting such phrases are syntactic, statistical, and a hybrid approach combining these two. In this paper we propose a supervised learning approach for identifying high quality phrases. First we obtain a set of known well-formed useful phrases from an existing source and label these phrases as positive. We then extract from MEDLINE a large set of multiword strings that do not contain stop words or punctuation. We believe this unlabeled set contains many well-formed phrases. Our goal is to identify these additional high quality phrases. We examine various feature combinations and several machine learning strategies designed to solve this problem. A proper choice of machine learning methods and features identifies in the large collection strings that are likely to be high quality phrases. We evaluate our approach by making human judgments on multiword strings extracted from MEDLINE using our methods. We find that over 85% of such extracted phrase candidates are humanly judged to be of high quality.