Safely selecting subsets of training data

  • Authors:
  • Dawei Yin;Chang An;Henry S. Baird

  • Affiliations:
  • Lehigh University, Bethlehem, PA;Lehigh University, Bethlehem, PA;Lehigh University, Bethlehem, PA

  • Venue:
  • DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Highly versatile classifiers for document analysis systems demand representative training sets which can be dauntingly large, often challenging conventional trainable classifier technologies. We propose to select a small subset of training data, matched to each particular test set, in hopes of improved speed without loss of accuracy. Since selection must occur on line, we cannot use classifiers that require off-line training. Fortunately, Nearest Neighbors classifiers support on-line training; we use a fast approximate kNN technology using hashed k-D trees. The distribution of samples in k-D bins can be used to measure similarity between any two document images: we select the three most similar training images for any given test image. In experiments on a document image content extraction system, our algorithm was able to prune 118 training images to three, for a speedup of a factor of 17 with no loss of accuracy. Other experiments with an oracle and manual selection suggest that it may be possible to improve accuracy as well.