Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets

  • Authors:
  • Yu Su;Gagan Agrawal

  • Affiliations:
  • -;-

  • Venue:
  • CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
  • Year:
  • 2012

Quantified Score

Hi-index 0.00

Visualization

Abstract

While dissemination of scientific data is becoming crucial for facilitating scientific discoveries, a key challenge being faced by these efforts is that the dataset sizes continue to grow rapidly. Coupled with the fact that wide area data transfer bandwidths and disk retrieval speeds are growing at a much slower pace, it is becoming extremely hard for scientists to download, manage, and process scientific datasets. We have developed a light-weight data management tool, which allows server-side sub setting and aggregation on scientific datasets stored in a native format. While our approach is more general, this paper describes an implementation specific to NetCDF, which is one of the most popular scientific data formats. To support a variety of queries efficiently, our tool generates code for pre-filtering and post-filtering, and parallelize selection and aggregation queries efficiently using novel algorithms. We have extensively evaluated our implementation and compared its performance and functionality against Open DAP. We demonstrate that even for sub setting queries that are directly supported in Open DAP, the sequential performance of our system is better. In addition, our system is capable of supporting a larger variety of queries, scaling performance by parallelizing the queries, and reducing wide area data transfers through server-side data aggregation.