A bootstrapping algorithm to improve cohort identification using structured data

Authors:
Sasikiran Kandula;Qing Zeng-Treitler;Lingji Chen;William L. Salomon;Bruce E. Bray
Affiliations:
Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, United States;Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, United States;Scientific Systems Company Inc., Woburn, MA, United States;Clinical Metrics LLC, Poland, ME, United States;Department of Biomedical Informatics, University of Utah, Salt Lake City, UT, United States
Venue:
Journal of Biomedical Informatics
Year:
2011

Citing 5
Cited 1

A knowledge-based, concept-oriented view generation system for clinical data

Computers and Biomedical Research
Improvements to Platt's SMO Algorithm for SVM Classifier Design

Neural Computation
Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems)
Learning classifiers from only positive and unlabeled data

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
An analytical approach to characterize morbidity profile dissimilarity between distinct cohorts using electronic medical records

Journal of Biomedical Informatics

Editorial: Selected Papers from the 2011 Summit on Clinical Research Informatics

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Cohort identification is an important step in conducting clinical research studies. Use of ICD-9 codes to identify disease cohorts is a common approach that can yield satisfactory results in certain conditions; however, for many use-cases more accurate methods are required. In this study, we propose a bootstrapping method that supplements ICD-9 codes with lab results, medications, etc. to build classification models that can be used to identify cohorts more accurately. The proposed method does not require prior information about the true class of the patients. We used the method to identify Diabetes Mellitus (DM) and Hyperlipidemia (HL) patient cohorts from a database of 800 thousand patients. Evaluation results show that the method identified 11,000 patients who did not have DM related ICD-9 codes as positive for DM and 52,000 patients without HL codes as positive for HL. A review of 400 patient charts (200 patients for each condition) by two clinicians shows that in both the conditions studied, the labeling assigned by the proposed approach is more consistent with that of the clinicians compared to labeling through ICD-9 codes. The method is reasonably automated and, we believe, holds potential for inexpensive, more accurate cohort identification.