The nature of statistical learning theory
The nature of statistical learning theory
Communications of the ACM
Communications of the ACM
Meaning and grammar (2nd ed.): an introduction to semantics
Meaning and grammar (2nd ed.): an introduction to semantics
Selective automated indexing of findings and diagnoses in radiology reports
Computers and Biomedical Research
Exploiting Hierarchy in Text Categorization
Information Retrieval
Hierarchical Text Categorization Using Neural Networks
Information Retrieval
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features
ECML '98 Proceedings of the 10th European Conference on Machine Learning
Hierarchically Classifying Documents Using Very Few Words
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
Improving Text Classification by Shrinkage in a Hierarchy of Classes
ICML '98 Proceedings of the Fifteenth International Conference on Machine Learning
On the algorithmic implementation of multiclass kernel-based vector machines
The Journal of Machine Learning Research
Support vector machine learning for interdependent and structured output spaces
ICML '04 Proceedings of the twenty-first international conference on Machine learning
Hierarchical document categorization with support vector machines
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Support vector machines classification with a very large-scale taxonomy
ACM SIGKDD Explorations Newsletter - Natural language processing and text mining
IEEE Transactions on Pattern Analysis and Machine Intelligence
Hierarchical Dirichlet model for document classification
ICML '05 Proceedings of the 22nd international conference on Machine learning
Artificial Intelligence in Medicine
MPLUS: a probabilistic medical language understanding system
BioMed '02 Proceedings of the ACL-02 workshop on Natural language processing in the biomedical domain - Volume 3
Journal of Biomedical Informatics - Special section: JAMA commentaries
Boosting multi-label hierarchical text categorization
Information Retrieval
How Much Noise Is Too Much: A Study in Automatic Text Classification
ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
A survey of types of text noise and techniques to handle noisy text
Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data
Rule-based information extraction from patients' clinical data
Journal of Biomedical Informatics
Identifying fall-related injuries: Text mining the electronic medical record
Information Technology and Management
Introduction to Information Quality
Introduction to Information Quality
Hi-index | 0.00 |
This article analyzes the data quality issues that emerge when training a shrinkage-based classifier with noisy data. A statistical text analysis technique based on a shrinkage-based variation of multinomial naive Bayes is applied to a set of free-text discharge diagnoses occurring in a number of hospitalizations. All of these diagnoses were previously coded according to the Spanish Edition of ICD9-CM. We deal with the issue of analyzing the predictive power and robustness of the statistical machine learning algorithm proposed for ICD-9-CM classification. We explore the effect of training the models using both clean and noisy data. In particular our work investigates the extent to which errors in free-text diagnoses propagate to the classification model. A measure of predictive accuracy is calculated for the text classification algorithm under analysis. Subsequently, the quality of the sample data is incrementally deteriorated by simulating errors in the text and/or codes. The predictive accuracy is recomputed for each of the noisy samples for comparison purposes. Our research shows that the shrinkage-based classifier is a valid alternative to automate ICD9-CM coding even under circumstances in which the quality of the training data is in question.