On sample size and classification accuracy: a performance comparison

Authors:
Margarita Sordo;Qing Zeng
Affiliations:
Decision Systems Group, Harvard Medical School, Boston, MA;Decision Systems Group, Harvard Medical School, Boston, MA
Venue:
ISBMDA'05 Proceedings of the 6th International conference on Biological and Medical Data Analysis
Year:
2005

Citing 7
Cited 5

Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners

IEEE Transactions on Pattern Analysis and Machine Intelligence
Expert network: effective and efficient learning from human decisions in text categorization and retrieval

SIGIR '94 Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval
The nature of statistical learning theory

The nature of statistical learning theory
Context-sensitive learning methods for text categorization

SIGIR '96 Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
Inductive learning algorithms and representations for text categorization

Proceedings of the seventh international conference on Information and knowledge management
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning

Automatic extraction of semantic content from medical discharge records

ICOSSE'06 Proceedings of the 5th WSEAS international conference on System science and simulation in engineering
A new monte carlo-based error rate estimator

ANNPR'10 Proceedings of the 4th IAPR TC3 conference on Artificial Neural Networks in Pattern Recognition
A method for determining the number of documents needed for a gold standard corpus

Journal of Biomedical Informatics
Gait verification using knee acceleration signals

Expert Systems with Applications: An International Journal
Improving predictive models of glaucoma severity by incorporating quality indicators

Artificial Intelligence in Medicine

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the dependency between sample size and classification accuracy of three classification techniques: Naïve Bayes, Support Vector Machines and Decision Trees over a set of 8500 text excerpts extracted automatically from narrative reports from the Brigham & Women's Hospital, Boston, USA. Each excerpt refers to the smoking status of a patient as: current, past, never a smoker or, denies smoking. Our empirical results, consistent with [1], confirm that size of the training set and the classification rate are indeed correlated. Even though these algorithms perform reasonably well with small datasets, as the number of cases increases, both SMV and Decision Trees show a substantial improvement in performance, suggesting a more consistent learning process. Unlike the majority of evaluations, ours were carried out specifically in a medical domain where the limited amount of data is a common occurrence [13][14]. This study is part of the I2B2 project, Core 2.