Dynamic categorization of clinical research eligibility criteria by hierarchical clustering

Authors:
Zhihui Luo;Meliha Yetisgen-Yildiz;Chunhua Weng
Affiliations:
Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States;Biomedical & Health Informatics, University Washington, Seattle, WA 98195, United States;Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
Venue:
Journal of Biomedical Informatics
Year:
2011

Citing 16
Cited 1

Instance-Based Learning Algorithms

Machine Learning
A Bayesian Method for the Induction of Probabilistic Networks from Data

Machine Learning
C4.5: programs for machine learning

C4.5: programs for machine learning
Assessing agreement on classification tasks: the kappa statistic

Computational Linguistics
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Machine Learning

Machine Learning
Faceted classification as a basis for knowledge organization in a digital environment: the bliss bibliographic classification as a model for vocabulary management and the creation of multidimensional knowledge structures

The New Review of Hypermedia and Multimedia
An ontology of randomized controlled trials for evidence-based practice: content specification and evaluation using the competency decomposition method

Journal of Biomedical Informatics
Text classification by labeling words

AAAI'04 Proceedings of the 19th national conference on Artifical intelligence
The WEKA data mining software: an update

ACM SIGKDD Explorations Newsletter
Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion

Bioinformatics
Automatically classifying sentences in full-text biomedical articles into Introduction, Methods, Results and Discussion

Bioinformatics
Methodological Review: Formal representation of eligibility criteria: A literature review

Journal of Biomedical Informatics
A practical method for transforming free-text eligibility criteria into computable criteria

Journal of Biomedical Informatics
Estimating continuous distributions in Bayesian classifiers

UAI'95 Proceedings of the Eleventh conference on Uncertainty in artificial intelligence
Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning

ICIC'05 Proceedings of the 2005 international conference on Advances in Intelligent Computing - Volume Part I

Analysis of eligibility criteria representation in industry-standard clinical trial protocols

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Objective: To semi-automatically induce semantic categories of eligibility criteria from text and to automatically classify eligibility criteria based on their semantic similarity. Design: The UMLS semantic types and a set of previously developed semantic preference rules were utilized to create an unambiguous semantic feature representation to induce eligibility criteria categories through hierarchical clustering and to train supervised classifiers. Measurements: We induced 27 categories and measured the prevalence of the categories in 27,278 eligibility criteria from 1578 clinical trials and compared the classification performance (i.e., precision, recall, and F1-score) between the UMLS-based feature representation and the ''bag of words'' feature representation among five common classifiers in Weka, including J48, Bayesian Network, Naive Bayesian, Nearest Neighbor, and instance-based learning classifier. Results: The UMLS semantic feature representation outperforms the ''bag of words'' feature representation in 89% of the criteria categories. Using the semantically induced categories, machine-learning classifiers required only 2000 instances to stabilize classification performance. The J48 classifier yielded the best F1-score and the Bayesian Network classifier achieved the best learning efficiency. Conclusion: The UMLS is an effective knowledge source and can enable an efficient feature representation for semi-automated semantic category induction and automatic categorization for clinical research eligibility criteria and possibly other clinical text.