Combining Bayesian Text Classification and Shrinkage to Automate Healthcare Coding: A Data Quality Analysis

  • Authors:
  • Eitel J. M. Lauría;Alan D. March

  • Affiliations:
  • Marist College, Poughkeepsie, NY;Hospital Universitario Austral, Buenos Aires, Argentina

  • Venue:
  • Journal of Data and Information Quality (JDIQ)
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

This article analyzes the data quality issues that emerge when training a shrinkage-based classifier with noisy data. A statistical text analysis technique based on a shrinkage-based variation of multinomial naive Bayes is applied to a set of free-text discharge diagnoses occurring in a number of hospitalizations. All of these diagnoses were previously coded according to the Spanish Edition of ICD9-CM. We deal with the issue of analyzing the predictive power and robustness of the statistical machine learning algorithm proposed for ICD-9-CM classification. We explore the effect of training the models using both clean and noisy data. In particular our work investigates the extent to which errors in free-text diagnoses propagate to the classification model. A measure of predictive accuracy is calculated for the text classification algorithm under analysis. Subsequently, the quality of the sample data is incrementally deteriorated by simulating errors in the text and/or codes. The predictive accuracy is recomputed for each of the noisy samples for comparison purposes. Our research shows that the shrinkage-based classifier is a valid alternative to automate ICD9-CM coding even under circumstances in which the quality of the training data is in question.