Designing and mining multi-terabyte astronomy archives: the Sloan Digital Sky Survey
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Practical lessons in supporting large-scale computational science
ACM SIGMOD Record
Integrating parallel file I/O and database support for high-performance scientific data management
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
An Approach for Automatic Data Virtualization
HPDC '04 Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing
Parallel netCDF: A High-Performance Scientific I/O Interface
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scientific data management in the coming decade
ACM SIGMOD Record
Clustered Workflow Execution of Retargeted Data Analysis Scripts
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Secure, Performance-Oriented Data Management for nanoCMOS Electronics
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Parallel index and query for large scale data analysis
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
SciHadoop: array-based query processing in Hadoop
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Taming massive distributed datasets: data sampling using bitmap indices
Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
SDQuery DSI: integrating data management support with a wide area data transfer protocol
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
While dissemination of scientific data is becoming crucial for facilitating scientific discoveries, a key challenge being faced by these efforts is that the dataset sizes continue to grow rapidly. Coupled with the fact that wide area data transfer bandwidths and disk retrieval speeds are growing at a much slower pace, it is becoming extremely hard for scientists to download, manage, and process scientific datasets. We have developed a light-weight data management tool, which allows server-side sub setting and aggregation on scientific datasets stored in a native format. While our approach is more general, this paper describes an implementation specific to NetCDF, which is one of the most popular scientific data formats. To support a variety of queries efficiently, our tool generates code for pre-filtering and post-filtering, and parallelize selection and aggregation queries efficiently using novel algorithms. We have extensively evaluated our implementation and compared its performance and functionality against Open DAP. We demonstrate that even for sub setting queries that are directly supported in Open DAP, the sequential performance of our system is better. In addition, our system is capable of supporting a larger variety of queries, scaling performance by parallelizing the queries, and reducing wide area data transfers through server-side data aggregation.