A randomized algorithm for closest-point queries
SIAM Journal on Computing
An algorithm for approximate closest-point queries
SCG '94 Proceedings of the tenth annual symposium on Computational geometry
Approximate nearest neighbors: towards removing the curse of dimensionality
STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
An optimal algorithm for approximate nearest neighbor searching fixed dimensions
Journal of the ACM (JACM)
Approximate nearest neighbor queries in fixed dimensions
SODA '93 Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms
An Algorithm for Finding Best Matches in Logarithmic Expected Time
ACM Transactions on Mathematical Software (TOMS)
Multidimensional binary search trees used for associative searching
Communications of the ACM
Similarity Search in High Dimensions via Hashing
VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
K-Nearest Neighbor Search for Moving Query Point
SSTD '01 Proceedings of the 7th International Symposium on Advances in Spatial and Temporal Databases
Pattern Classification (2nd Edition)
Pattern Classification (2nd Edition)
Locality-sensitive hashing scheme based on p-stable distributions
SCG '04 Proceedings of the twentieth annual symposium on Computational geometry
Towards versatile document analysis systems
DAS'06 Proceedings of the 7th international conference on Document Analysis Systems
Hi-index | 0.00 |
Highly versatile classifiers for document analysis systems demand representative training sets which can be dauntingly large, often challenging conventional trainable classifier technologies. We propose to select a small subset of training data, matched to each particular test set, in hopes of improved speed without loss of accuracy. Since selection must occur on line, we cannot use classifiers that require off-line training. Fortunately, Nearest Neighbors classifiers support on-line training; we use a fast approximate kNN technology using hashed k-D trees. The distribution of samples in k-D bins can be used to measure similarity between any two document images: we select the three most similar training images for any given test image. In experiments on a document image content extraction system, our algorithm was able to prune 118 training images to three, for a speedup of a factor of 17 with no loss of accuracy. Other experiments with an oracle and manual selection suggest that it may be possible to improve accuracy as well.