Design and analysis of a multi-dimensional data sampling service for large scale data analysis applications

  • Authors:
  • Xi Zhang;Tahsin Kurc;Joel Saltz;Srinivasan Parthasarathy

  • Affiliations:
  • Department of Biomedical Informatics and Department of Computer Science and Engineering, The Ohio State University, Columbus, OH;Department of Biomedical Informatics, The Ohio State University, Columbus, OH;Department of Biomedical Informatics and Department of Computer Science and Engineering, The Ohio State University, Columbus, OH;Department of Biomedical Informatics and Department of Computer Science and Engineering, The Ohio State University, Columbus, OH

  • Venue:
  • IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

Sampling is a widely used technique to increase efficiency in database and data mining applications operating on large dataset. In this paper we present a scalable sampling implementation that supports efficient, multi-dimensional spatio-temporal sample generation on dynamic, large scale datasets stored on a storage cluster. The proposed algorithm leverages Hilbert space-filling curves in order to provide an approximate linear order of multidimensional data while maintaining spatial locality. This new implementation is then bootstrapped on top of our previous implementation, which efficiently samples large datasets along a single dimension (e.g., time), thereby realizing a service for spatio-temporal sampling. We evaluate the performance of our approach comparing it to the popular R-tree based technique. The experimental results show that our approach achieves up to an order of magnitude higher efficiency and scalability.