Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets

Authors:
Yu Su;Gagan Agrawal
Affiliations:
-;-
Venue:
CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Year:
2012

Citing 10
Cited 2

Designing and mining multi-terabyte astronomy archives: the Sloan Digital Sky Survey

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Practical lessons in supporting large-scale computational science

ACM SIGMOD Record
Integrating parallel file I/O and database support for high-performance scientific data management

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
An Approach for Automatic Data Virtualization

HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
Parallel netCDF: A High-Performance Scientific I/O Interface

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scientific data management in the coming decade

ACM SIGMOD Record
Clustered Workflow Execution of Retargeted Data Analysis Scripts

CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Secure, Performance-Oriented Data Management for nanoCMOS Electronics

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Parallel index and query for large scale data analysis

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
SciHadoop: array-based query processing in Hadoop

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Taming massive distributed datasets: data sampling using bitmap indices

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
SDQuery DSI: integrating data management support with a wide area data transfer protocol

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

While dissemination of scientific data is becoming crucial for facilitating scientific discoveries, a key challenge being faced by these efforts is that the dataset sizes continue to grow rapidly. Coupled with the fact that wide area data transfer bandwidths and disk retrieval speeds are growing at a much slower pace, it is becoming extremely hard for scientists to download, manage, and process scientific datasets. We have developed a light-weight data management tool, which allows server-side sub setting and aggregation on scientific datasets stored in a native format. While our approach is more general, this paper describes an implementation specific to NetCDF, which is one of the most popular scientific data formats. To support a variety of queries efficiently, our tool generates code for pre-filtering and post-filtering, and parallelize selection and aggregation queries efficiently using novel algorithms. We have extensively evaluated our implementation and compared its performance and functionality against Open DAP. We demonstrate that even for sub setting queries that are directly supported in Open DAP, the sequential performance of our system is better. In addition, our system is capable of supporting a larger variety of queries, scaling performance by parallelizing the queries, and reducing wide area data transfers through server-side data aggregation.