An approximate algorithm for top-k closest pairs join query in large high dimensional data

Authors:
Fabrizio Angiulli;Clara Pizzuti
Affiliations:
ICAR-CNR Instituto di Calcolo e Reti ad Alte Prestazioni, Consiglio Nazionale delle Ricerche 87030 Rende, CS, Italy;ICAR-CNR Instituto di Calcolo e Reti ad Alte Prestazioni, Consiglio Nazionale delle Ricerche 87030 Rende, CS, Italy
Venue:
Data & Knowledge Engineering
Year:
2005

Citing 27
Cited 2

Computational geometry: an introduction

Computational geometry: an introduction
Multiattribute hashing using Gray codes

SIGMOD '86 Proceedings of the 1986 ACM SIGMOD international conference on Management of data
Fractals for secondary key retrieval

PODS '89 Proceedings of the eighth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Linear clustering of objects with multiple attributes

SIGMOD '90 Proceedings of the 1990 ACM SIGMOD international conference on Management of data
Euclidean minimum spanning trees and bichromatic closest pairs

Discrete & Computational Geometry
Finding k farthest pairs and k closest/farthest bichromatic pairs for points in the plane

SCG '92 Proceedings of the eighth annual symposium on Computational geometry
Approximate nearest neighbor queries revisited

SCG '97 Proceedings of the thirteenth annual symposium on Computational geometry
Incremental distance join algorithms for spatial databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Distance browsing in spatial databases

ACM Transactions on Database Systems (TODS)
Closest pair queries in spatial databases

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Adaptive multi-stage distance join processing

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
High performance clustering based on the similarity join

Proceedings of the ninth international conference on Information and knowledge management
Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data

SIGMOD '01 Proceedings of the 2001 ACM SIGMOD international conference on Management of data
Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases

ACM Computing Surveys (CSUR)
The Art of Computer Programming Volumes 1-3 Boxed Set

The Art of Computer Programming Volumes 1-3 Boxed Set
High Dimensional Similarity Joins: Algorithms and Performance Evaluation

ICDE '98 Proceedings of the Fourteenth International Conference on Data Engineering
High Dimensional Similarity Search With Space Filling Curves

Proceedings of the 17th International Conference on Data Engineering
C2P: Clustering based on Closest Pairs

Proceedings of the 27th International Conference on Very Large Data Bases
Approximate k -Closest-Pairs with Space Filling Curves

DaWaK 2000 Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery
An index structure for improving nearest closest pairs and related join queries in spatial databases

IDEAS '02 Proceedings of the 2002 International Symposium on Database Engineering & Applications
Similarity Join for Low-and High-Dimensional Data

DASFAA '03 Proceedings of the Eighth International Conference on Database Systems for Advanced Applications
Divide-and-conquer in multidimensional space

STOC '76 Proceedings of the eighth annual ACM symposium on Theory of computing
Efficient evaluation of relevance feedback for multidimensional all-pairs retrieval

Proceedings of the 2003 ACM symposium on Applied computing
Evaluating Refined Queries in Top-k Retrieval Systems

IEEE Transactions on Knowledge and Data Engineering
Algorithms for processing K-closest-pair queries in spatial databases

Data & Knowledge Engineering
Top-k Closest Pairs Join Query: An Approximate Algorithm for Large High Dimensional Data

IDEAS '04 Proceedings of the International Database Engineering and Applications Symposium
Global Optimization with Non-Convex Constraints - Sequential and Parallel Algorithms (Nonconvex Optimization and its Applications Volume 45) (Nonconvex Optimization and Its Applications)

Global Optimization with Non-Convex Constraints - Sequential and Parallel Algorithms (Nonconvex Optimization and its Applications Volume 45) (Nonconvex Optimization and Its Applications)

Solving similarity joins and range queries in metric spaces with the list of twin clusters

Journal of Discrete Algorithms
On efficient mutual nearest neighbor query processing in spatial databases

Data & Knowledge Engineering

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper we present a novel approximate algorithm to calculate the top-k closest pairs join query of two large and high dimensional data sets. The algorithm has worst case time complexity O(d2nk) and space complexity O(nd) and guarantees a solution within a O(d1 + 1/t) factor of the exact one, where t ∈ {1,2,...,∞} denotes the Minkowski metrics Lt of interest and d the dimensionality. It makes use of the concept of space filling curve to establish an order between the points of the space and performs at most d + 1 sorts and scans of the two data sets. During a sca\n, each point from one data set is compared with its closest points, according to the space filling curve order, in the other data set and points whose contribution to the solution has already been analyzed are detected and eliminated. Experimental results on real and synthetic data sets show that our algorithm behaves as an exact algorithm in low dimensional spaces; it is able to prune the entire (or a considerable fraction of the) data set even for high dimensions if certain separation conditions are satisfied; in any case it returns a solution within a small error to the exact one.