Efficient Bulk Loading of Large High-Dimensional Indexes

Authors:
Christian Böhm;Hans-Peter Kriegel
Affiliations:
-;-
Venue:
DaWaK '99 Proceedings of the First International Conference on Data Warehousing and Knowledge Discovery
Year:
1999

Citing 8
Cited 3

Algorithms for clustering data

Algorithms for clustering data
A cost model for nearest neighbor search in high-dimensional data space

PODS '97 Proceedings of the sixteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
Improving the Query Performance of High-Dimensional Index Structures by Bulk-Load Operations

EDBT '98 Proceedings of the 6th International Conference on Extending Database Technology: Advances in Database Technology
A Generic Approach to Bulk Loading Multidimensional Index Structures

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces

VLDB '98 Proceedings of the 24rd International Conference on Very Large Data Bases
Efficient and Effective Clustering Methods for Spatial Data Mining

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Hilbert R-tree: An Improved R-tree using Fractals

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
The X-tree: An Index Structure for High-Dimensional Data

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases

Bulk Operations for Space-Partitioning Trees

ICDE '04 Proceedings of the 20th International Conference on Data Engineering
Bulk loading a linear hash file

DaWaK'06 Proceedings of the 8th international conference on Data Warehousing and Knowledge Discovery
Research and implement of real-time data loading system IMIL

WISE'06 Proceedings of the 7th international conference on Web Information Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Efficient index construction in multidimensional data spaces is important for many knowledge discovery algorithms, because construction times typically must be amortized by performance gains in query processing. In this paper, we propose a generic bulk loading method which allows the application of user-defined split strategies in the index construction. This approach allows the adaptation of the index properties to the requirements of a specific knowledge discovery algorithm. As our algorithm takes into account that large data sets do not fit in main memory, our algorithm is based on external sorting. Decisions of the split strategy can be made according to a sample of the data set which is selected automatically. The sort algorithm is a variant of the well-known Quicksort algorithm, enhanced to work on secondary storage. The index construction has a runtime complexity of O(n log n). We show both analytically and experimentally that the algorithm outperforms traditional index construction methods by large factors.