High Performance Data Mining Using the Nearest Neighbor Join

Authors:
Christian Böhm;Florian Krebs
Affiliations:
-;-
Venue:
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Year:
2002

Citing 0
Cited 10

Multi-Way Distance Join Queries in Spatial Databases

Geoinformatica
SIREN: a similarity retrieval engine for complex data

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Efficient index-based KNN join processing for high-dimensional data

Information and Software Technology
A performance comparison of distance-based query algorithms using R-trees in spatial databases

Information Sciences: an International Journal
On efficient spatial matching

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
New Neighborhood Based Classification Rules for Metric Spaces and Their Use in Ensemble Classification

IbPRIA '07 Proceedings of the 3rd Iberian conference on Pattern Recognition and Image Analysis, Part I
Customer's Relationship Segmentation Driving the Predictive Modeling for Bad Debt Events

UMAP '09 Proceedings of the 17th International Conference on User Modeling, Adaptation, and Personalization: formerly UM and AH
Design and evaluation of trajectory join algorithms

Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
Combining elimination rules in tree-based nearest neighbor search algorithms

SSPR&SPR'10 Proceedings of the 2010 joint IAPR international conference on Structural, syntactic, and statistical pattern recognition
A disk-aware algorithm for time series motif discovery

Data Mining and Knowledge Discovery

Quantified Score

Hi-index	0.00

Visualization

Abstract

The similarity join has become an important database primitiveto support similarity search and data mining. A similarity joincombines two sets of complex objects such that the result containsall pairs of similar objects. Well-known are two types of thesimilarity join, the distance range join where the user defines adistance threshold for the join, and the closest point query ork-distance join which retrieves the k most similar pairs. In thispaper, we investigate an important, third similarity join operationcalled k-nearest neighbor join which combines each point ofone point set with its k nearest neighbors in the other set. It hasbeen shown that many standard algorithms of Knowledge Discoveryin Databases (KDD) such as k-means and k-medoid clustering,nearest neighbor classification, data cleansing, postprocessingof sampling-based data mining etc. can be implementedon top of the k-nn join operation to achieve performance improvementswithout affecting the quality of the result of these algorithms.We propose a new algorithm to compute the k-nearestneighbor join using the multipage index (MuX), a specialized indexstructure for the similarity join. To reduce both CPU and I/Ocost, we develop optimal loading and processing strategies.