Effect of small sample size on text categorization with support vector machines

Authors:
Pawel Matykiewicz;John Pestian
Affiliations:
Biomedical Informatics, Cincinnati Children's Hospital, Cincinnat, OH;Biomedical Informatics, Cincinnati Children's Hospital, Cincinnat, OH
Venue:
BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
Year:
2012

Citing 12
Cited 0

Term-weighting approaches in automatic text retrieval

Information Processing and Management: an International Journal
Automated learning of decision rules for text categorization

ACM Transactions on Information Systems (TOIS)
What Size Test Set Gives Good Error Rate Estimates?

IEEE Transactions on Pattern Analysis and Machine Intelligence
Foundations of statistical natural language processing

Foundations of statistical natural language processing
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
Choosing Multiple Parameters for Support Vector Machines

Machine Learning
Text Categorization with Suport Vector Machines: Learning with Many Relevant Features

ECML '98 Proceedings of the 10th European Conference on Machine Learning
A Comparative Study on Feature Selection in Text Categorization

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
An introduction to variable and feature selection

The Journal of Machine Learning Research
Practical FDR-based sample size calculations in microarray experiments

Bioinformatics
Introduction to Information Retrieval

Introduction to Information Retrieval
A method for determining the number of documents needed for a gold standard corpus

Journal of Biomedical Informatics

Quantified Score

Hi-index	0.00

Visualization

Abstract

Datasets that answer difficult clinical questions are expensive in part due to the need for medical expertise and patient informed consent. We investigate the effect of small sample size on the performance of a text categorization algorithm. We show how to determine whether the dataset is large enough to train support vector machines. Since it is not possible to cover all aspects of sample size calculation in one manuscript, we focus on how certain types of data relate to certain properties of support vector machines. We show that normal vectors of decision hyperplanes can be used for assessing reliability and internal cross-validation can be used for assessing stability of small sample data.