ISABELA-QA: query-driven analytics with ISABELA-compressed extreme-scale scientific data

  • Authors:
  • Sriram Lakshminarasimhan;John Jenkins;Isha Arkatkar;Zhenhuan Gong;Hemanth Kolla;Seung-Hoe Ku;Stephane Ethier;Jackie Chen;C. S. Chang;Scott Klasky;Robert Latham;Robert Ross;Nagiza F. Samatova

  • Affiliations:
  • North Carolina State University, NC and Oak Ridge National Laboratory, Oak Ridge, TN;North Carolina State University, NC and Oak Ridge National Laboratory, Oak Ridge, TN;North Carolina State University, NC and Oak Ridge National Laboratory, Oak Ridge, TN;North Carolina State University, NC;Sandia National Laboratory, Livermore, CA;New York University, New York, NY;Princeton Plasma Physics Laboratory, Princeton, NJ;Sandia National Laboratory, Livermore, CA;New York University, New York, NY;Oak Ridge National Laboratory, Oak Ridge, TN;Argonne National Laboratory, Argonne, IL;Argonne National Laboratory, Argonne, IL;North Carolina State University, NC and Oak Ridge National Laboratory, Oak Ridge, TN

  • Venue:
  • Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
  • Year:
  • 2011

Quantified Score

Hi-index 0.00

Visualization

Abstract

Efficient analytics of scientific data from extreme-scale simulations is quickly becoming a top-notch priority. The increasing simulation output data sizes demand for a paradigm shift in how analytics is conducted. In this paper, we argue that query-driven analytics over compressed---rather than original, full-size---data is a promising strategy in order to meet storage-and-I/O-bound application challenges. As a proof-of-principle, we propose a parallel query processing engine, called ISABELA-QA that is designed and optimized for knowledge priors driven analytical processing of spatio-temporal, multivariate scientific data that is initially compressed, in situ, by our ISABELA technology. With ISABELA-QA, the total data storage requirement is less than 23%-30% of the original data, which is upto eight-fold less than what the existing state-of-the-art data management technologies that require storing both the original data and the index could offer. Since ISABELA-QA operates on the metadata generated by our compression technology, its underlying indexing technology for efficient query processing is light-weight; it requires less than 3% of the original data, unlike existing database indexing approaches that require 30%-300% of the original data. Moreover, ISABELA-QA is specifically optimized to retrieve the actual values rather than spatial regions for the variables that satisfy user-specified range queries---a functionality that is critical for high-accuracy data analytics. To the best of our knowledge, this is the first techology that enables query-driven analytics over the compressed spatio-temporal floating-point double-or single-precision data, while offering a light-weight memory and disk storage footprint solution with parallel, scalable, multi-node, multi-core, GPU-based query processing.