Statistical modeling of large-scale simulation data

Authors:
Tina Eliassi-Rad;Terence Critchlow;Ghaleb Abdulla
Affiliations:
Center for Applied Scientific Computing, Livermore, CA;Center for Applied Scientific Computing, Livermore, CA;Center for Applied Scientific Computing, Livermore, CA
Venue:
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2002

Citing 7
Cited 1

Goodness-of-fit techniques

Goodness-of-fit techniques
The Aqua approximate query answering system

SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Adaptive, multiresolution visualization of large data sets using a distributed memory octree

SC '99 Proceedings of the 1999 ACM/IEEE conference on Supercomputing
Practical lessons in supporting large-scale computational science

ACM SIGMOD Record
Approximate ad-hoc query engine for simulation data

Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
Approximate Query Processing Using Wavelets

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
STING: A Statistical Information Grid Approach to Spatial Data Mining

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases

Simulation data as data streams

ACM SIGMOD Record

Quantified Score

Hi-index	0.00

Visualization

Abstract

With the advent of fast computer systems, scientists are now able to generate terabytes of simulation data. Unfortunately, the sheer size of these data sets has made efficient exploration of them impossible. To aid scientists in gleaning insight from their simulation data, we have developed an ad-hoc query infrastructure. Our system, called AQSim (short for Ad-hoc Queries for Simulation) reduces the data storage requirements and query access times in two stages. First, it creates and stores mathematical and statistical models of the data at multiple resolutions. Second, it evaluates queries on the models of the data instead of on the entire data set. In this paper, we present two simple but effective statistical modeling techniques for simulation data. Our first modeling technique computes the "true" (unbiased) mean of systematic partitions of the data. It makes no assumptions about the distribution of the data and uses a variant of the root mean square error to evaluate a model. Our second statistical modeling technique uses the Andersen-Darling goodness-of-fit method on systematic partitions of the data. This method evaluates a model by how well it passes the normality test on the data. Both of our statistical models effectively answer range queries. At each resolution of the data, we compute the precision of our answer to the user's query by scaling the one-sided Chebyshev Inequalities with the original mesh's topology. We combine precisions at different resolutions by calculating their weighted average. Our experimental evaluations on two scientific simulation data sets illustrate the value of using these statistical modeling techniques on multiple resolutions of large simulation data sets.