Semi-supervised clinical text classification with Laplacian SVMs: An application to cancer case management

Authors:
Vijay Garla;Caroline Taylor;Cynthia Brandt
Affiliations:
Yale Center for Medical Informatics, Yale University, 300 George Street, Suite 501, New Haven, CT 06520-8009, United States;Connecticut VA Healthcare System, 950 Campbell Avenue, West Haven, CT 06516, United States;Yale Center for Medical Informatics, Yale University, 300 George Street, Suite 501, New Haven, CT 06520-8009, United States and Connecticut VA Healthcare System, 950 Campbell Avenue, West Haven, C ...
Venue:
Journal of Biomedical Informatics
Year:
2013

Citing 9
Cited 0

Transductive Inference for Text Classification using Support Vector Machines

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Beyond the point cloud: from transductive to semi-supervised learning

ICML '05 Proceedings of the 22nd international conference on Machine learning
Extracting information on pneumonia in infants using natural language processing of radiology reports

Journal of Biomedical Informatics - Special section: JAMA commentaries
Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples

The Journal of Machine Learning Research
A shared task involving multi-label classification of clinical free text

BioNLP '07 Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing
Methodological Review: What can natural language processing do for clinical decision support?

Journal of Biomedical Informatics
LIBSVM: A library for support vector machines

ACM Transactions on Intelligent Systems and Technology (TIST)
Laplacian Support Vector Machines Trained in the Primal

The Journal of Machine Learning Research
Applying active learning to assertion classification of concepts in clinical text

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Objective: To compare linear and Laplacian SVMs on a clinical text classification task; to evaluate the effect of unlabeled training data on Laplacian SVM performance. Background: The development of machine-learning based clinical text classifiers requires the creation of labeled training data, obtained via manual review by clinicians. Due to the effort and expense involved in labeling data, training data sets in the clinical domain are of limited size. In contrast, electronic medical record (EMR) systems contain hundreds of thousands of unlabeled notes that are not used by supervised machine learning approaches. Semi-supervised learning algorithms use both labeled and unlabeled data to train classifiers, and can outperform their supervised counterparts. Methods: We trained support vector machines (SVMs) and Laplacian SVMs on a training reference standard of 820 abdominal CT, MRI, and ultrasound reports labeled for the presence of potentially malignant liver lesions that require follow up (positive class prevalence 77%). The Laplacian SVM used 19,845 randomly sampled unlabeled notes in addition to the training reference standard. We evaluated SVMs and Laplacian SVMs on a test set of 520 labeled reports. Results: The Laplacian SVM trained on labeled and unlabeled radiology reports significantly outperformed supervised SVMs (Macro-F1 0.773 vs. 0.741, Sensitivity 0.943 vs. 0.911, Positive Predictive value 0.877 vs. 0.883). Performance improved with the number of labeled and unlabeled notes used to train the Laplacian SVM (pearson's @r=0.529 for correlation between number of unlabeled notes and macro-F1 score). These results suggest that practical semi-supervised methods such as the Laplacian SVM can leverage the large, unlabeled corpora that reside within EMRs to improve clinical text classification.