Efficient model selection for large-scale nearest-neighbor data mining

Authors:
Greg Hamerly;Greg Speegle
Affiliations:
Baylor University, Waco, TX;Baylor University, Waco, TX
Venue:
BNCOD'10 Proceedings of the 27th British national conference on Data Security and Security Data
Year:
2010

Citing 19
Cited 0

Discriminant Adaptive Nearest Neighbor Classification

IEEE Transactions on Pattern Analysis and Machine Intelligence
Approximate nearest neighbors: towards removing the curse of dimensionality

STOC '98 Proceedings of the thirtieth annual ACM symposium on Theory of computing
An optimal algorithm for approximate nearest neighbor searching fixed dimensions

Journal of the ACM (JACM)
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
An Algorithm for Finding Best Matches in Logarithmic Expected Time

ACM Transactions on Mathematical Software (TOMS)
Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases

ACM Computing Surveys (CSUR)
R-trees: a dynamic index structure for spatial searching

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
The TV-tree: an index structure for high-dimensional data

The VLDB Journal — The International Journal on Very Large Data Bases - Spatial Database Systems
Efficient algorithms for decision tree cross-validation

ICML '01 Proceedings of the Eighteenth International Conference on Machine Learning
Complete Cross-Validation for Nearest Neighbor Classifiers

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
Similarity Search in High Dimensions via Hashing

VLDB '99 Proceedings of the 25th International Conference on Very Large Data Bases
Efficient Cross-Validation in ILP

ILP '01 Proceedings of the 11th International Conference on Inductive Logic Programming
Scalable collaborative filtering using cluster-based smoothing

Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
On Visualization and Aggregation of Nearest Neighbor Classifiers

IEEE Transactions on Pattern Analysis and Machine Intelligence
An analysis of the coupling between training set and neighborhood sizes for the kNN classifier

SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing)

Nearest-Neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing)
Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence

Pattern Recognition
Top 10 algorithms in data mining

Knowledge and Information Systems
Empirical evaluation of the difficulty of finding a good value of k for the nearest neighbor

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartII

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the most widely used models for large-scale data mining is the k-nearest neighbor (k-nn) algorithm. It can be used for classification, regression, density estimation, and information retrieval. To use k-nn, a practitioner must first choose k, usually selecting the k with the minimal loss estimated by cross-validation. In this work, we begin with an existing but little-studied method that greatly accelerates the cross-validation process for selecting k from a range of user-provided possibilities. The result is that a much larger range of k values may be examined more quickly. Next, we extend this algorithm with an additional optimization to provide improved performance for locally linear regression problems. We also show how this method can be applied to automatically select the range of k values when the user has no a priori knowledge of appropriate bounds. Furthermore, we apply statistical methods to reduce the number of examples examined while still finding a likely best k, greatly improving performance for large data sets. Finally, we present both analytical and experimental results that demonstrate these benefits.