An empirical study on selective partitioning dimensions for partition-based similarity joins

  • Authors:
  • Hyoseop Shin

  • Affiliations:
  • Department of Internet and Multimedia Engineering, Konkuk University, Seoul, Republic of Korea

  • Venue:
  • Data & Knowledge Engineering
  • Year:
  • 2007

Quantified Score

Hi-index 0.00

Visualization

Abstract

Real-world application data are usually distributed sparsely and non-uniformly in the high dimensional space that is huge in size. Hence, selection of effective partitioning dimensions is crucial for partition-based similarity joins. In this paper, we present two data partitioning algorithms for evaluations. PerDimSelect selects some dimension axes from the original perpendicular dimension axes pool, and maps each data point into the reduced dimension space. DiaDimSelect creates one-dimensional axis by combining some of original perpendicular dimensions, and maps each data point into the newly-created dimension. In the experiments, several measures are used to compare the performances of the algorithms including CPU cost, total response time, number of created buckets. In conclusion, DiaDimSelect shows better performance than PerDimSelect, for it creates much less partition buckets with the increasing number of partitioning dimensions, which leads to keep the IO cost less expensive while decreasing CPU cost considerably.