Comparison of documents classification techniques to classify medical reports

Authors:
F. H. Saad;B. de la Iglesia;G. D. Bell
Affiliations:
School of Computing Sciences, University of East Anglia, Norwich, UK;School of Computing Sciences, University of East Anglia, Norwich, UK;School of Computing Sciences, University of East Anglia, Norwich, UK
Venue:
PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Year:
2006

Citing 11
Cited 0

Evaluating text categorization

HLT '91 Proceedings of the workshop on Speech and Natural Language
Representation and learning in information retrieval

Representation and learning in information retrieval
Learning to classify text from labeled and unlabeled documents

AAAI '98/IAAI '98 Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence
Making large-scale support vector machine learning practical

Advances in kernel methods
A re-examination of text categorization methods

Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Introduction to Modern Information Retrieval

Introduction to Modern Information Retrieval
Partially Supervised Classification of Text Documents

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
PAC Learning from Positive Statistical Queries

ALT '98 Proceedings of the 9th International Conference on Algorithmic Learning Theory
Building Text Classifiers Using Positive and Unlabeled Examples

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Learning to classify texts using positive and unlabeled data

IJCAI'03 Proceedings of the 18th international joint conference on Artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper addresses a real world problem: the classification of text documents in the medical domain. There are a number of approaches to classifying text documents. Here, we use a partially supervised classification approach and argue that it is effective and computationally efficient for real-world problems. The approach uses a two-step strategy to cut down on the effort required to label each document for classification. Only a small set of positive documents are labeled initially, with others being labeled automatically as a result of the first step. The second step builds the actual text classifier. There are a number of methods that have been proposed for each step. A comprehensive evaluation of various combinations of methods is conducted to compare their performances using real world medical documents. The results show that using EM based methods to build the classifier yields better results than SVM. We also experimentally show that careful selection of a subset of features to represent the documents can improve the performance of the classifiers.