Combining Bayesian Text Classification and Shrinkage to Automate Healthcare Coding: A Data Quality Analysis

Authors:
Eitel J. M. Lauría;Alan D. March
Affiliations:
Marist College, Poughkeepsie, NY;Hospital Universitario Austral, Buenos Aires, Argentina
Venue:
Journal of Data and Information Quality (JDIQ)
Year:
2011

Citing 27
Cited 0

The nature of statistical learning theory

The nature of statistical learning theory
Data quality in context

Communications of the ACM
Examining data quality

Communications of the ACM
Meaning and grammar (2nd ed.): an introduction to semantics

Meaning and grammar (2nd ed.): an introduction to semantics
Selective automated indexing of findings and diagnoses in radiology reports

Computers and Biomedical Research
Exploiting Hierarchy in Text Categorization

Information Retrieval
Hierarchical Text Categorization Using Neural Networks

Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
Hierarchically Classifying Documents Using Very Few Words

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes

ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
On the algorithmic implementation of multiclass kernel-based vector machines

The Journal of Machine Learning Research
Support vector machine learning for interdependent and structured output spaces

ICML '04 Proceedings of the twenty-first international conference on Machine learning
Hierarchical document categorization with support vector machines

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Support vector machines classification with a very large-scale taxonomy

ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
Noisy Text Categorization

IEEE Transactions on Pattern Analysis and Machine Intelligence
Hierarchical Dirichlet model for document classification

ICML '05 Proceedings of the 22nd international conference on Machine learning
Classifying free-text triage chief complaints into syndromic categories with natural languages processing

Artificial Intelligence in Medicine
MPLUS: a probabilistic medical language understanding system

BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Extracting information on pneumonia in infants using natural language processing of radiology reports

Journal of Biomedical Informatics - Special section: JAMA commentaries
Boosting multi-label hierarchical text categorization

Information Retrieval
How Much Noise Is Too Much: A Study in Automatic Text Classification

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
A survey of types of text noise and techniques to handle noisy text

Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Rule-based information extraction from patients' clinical data

Journal of Biomedical Informatics
Identifying fall-related injuries: Text mining the electronic medical record

Information Technology and Management
Introduction to Information Quality

Introduction to Information Quality

Quantified Score

Hi-index	0.00

Visualization

Abstract

This article analyzes the data quality issues that emerge when training a shrinkage-based classifier with noisy data. A statistical text analysis technique based on a shrinkage-based variation of multinomial naive Bayes is applied to a set of free-text discharge diagnoses occurring in a number of hospitalizations. All of these diagnoses were previously coded according to the Spanish Edition of ICD9-CM. We deal with the issue of analyzing the predictive power and robustness of the statistical machine learning algorithm proposed for ICD-9-CM classification. We explore the effect of training the models using both clean and noisy data. In particular our work investigates the extent to which errors in free-text diagnoses propagate to the classification model. A measure of predictive accuracy is calculated for the text classification algorithm under analysis. Subsequently, the quality of the sample data is incrementally deteriorated by simulating errors in the text and/or codes. The predictive accuracy is recomputed for each of the noisy samples for comparison purposes. Our research shows that the shrinkage-based classifier is a valid alternative to automate ICD9-CM coding even under circumstances in which the quality of the training data is in question.