High Performance Data Mining Using the Nearest Neighbor Join

  • Authors:
  • Christian Böhm;Florian Krebs

  • Affiliations:
  • -;-

  • Venue:
  • ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

The similarity join has become an important database primitiveto support similarity search and data mining. A similarity joincombines two sets of complex objects such that the result containsall pairs of similar objects. Well-known are two types of thesimilarity join, the distance range join where the user defines adistance threshold for the join, and the closest point query ork-distance join which retrieves the k most similar pairs. In thispaper, we investigate an important, third similarity join operationcalled k-nearest neighbor join which combines each point ofone point set with its k nearest neighbors in the other set. It hasbeen shown that many standard algorithms of Knowledge Discoveryin Databases (KDD) such as k-means and k-medoid clustering,nearest neighbor classification, data cleansing, postprocessingof sampling-based data mining etc. can be implementedon top of the k-nn join operation to achieve performance improvementswithout affecting the quality of the result of these algorithms.We propose a new algorithm to compute the k-nearestneighbor join using the multipage index (MuX), a specialized indexstructure for the similarity join. To reduce both CPU and I/Ocost, we develop optimal loading and processing strategies.