FastQuery: A Parallel Indexing System for Scientific Data

Authors:
Jerry Chou;Kesheng Wu; Prabhat
Affiliations:
-;-;-
Venue:
CLUSTER '11 Proceedings of the 2011 IEEE International Conference on Cluster Computing
Year:
2011

Citing 0
Cited 6

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Parallel I/O, analysis, and visualization of a trillion particle simulation

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Scalable in situ scientific data encoding for analytical query processing

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Optimizing fastquery performance on lustre file system

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
Examining extended and scientific metadata for scalable index designs

Proceedings of the 6th International Systems and Storage Conference
SDS: a framework for scientific data services

PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies such as FastBit can significantly improve accesses to these datasets by augmenting the user data with indexes and other secondary information. However, a challenge is that the indexes assume the relational data model but the scientific data generally follows the array data model. To match the two data models, we design a generic mapping mechanism and implement an efficient input and output interface for reading and writing the data and their corresponding indexes. To take advantage of the emerging many-core architectures, we also develop a parallel strategy for indexing using threading technology. This approach complements our on-going MPI-based parallelization efforts. We demonstrate the flexibility of our software by applying it to two of the most commonly used scientific data formats, HDF5 and NetCDF. We present two case studies using data from a particle accelerator model and a global climate model. We also conducted a detailed performance study using these scientific datasets. The results show that FastQuery speeds up the query time by a factor of 2.5x to 50x, and it reduces the indexing time by a factor of 16 on 24 cores.