Effect of small sample size on text categorization with support vector machines

  • Authors:
  • Pawel Matykiewicz;John Pestian

  • Affiliations:
  • Biomedical Informatics, Cincinnati Children's Hospital, Cincinnat, OH;Biomedical Informatics, Cincinnati Children's Hospital, Cincinnat, OH

  • Venue:
  • BioNLP '12 Proceedings of the 2012 Workshop on Biomedical Natural Language Processing
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

Datasets that answer difficult clinical questions are expensive in part due to the need for medical expertise and patient informed consent. We investigate the effect of small sample size on the performance of a text categorization algorithm. We show how to determine whether the dataset is large enough to train support vector machines. Since it is not possible to cover all aspects of sample size calculation in one manuscript, we focus on how certain types of data relate to certain properties of support vector machines. We show that normal vectors of decision hyperplanes can be used for assessing reliability and internal cross-validation can be used for assessing stability of small sample data.