Parallel index and query for large scale data analysis

Authors:
Jerry Chou;Mark Howison;Brian Austin;Kesheng Wu;Ji Qiang;E. Wes Bethel;Arie Shoshani;Oliver Rübel; Prabhat;Rob D. Ryne
Affiliations:
-;-;-;-;-;-;-;-;-;-
Venue:
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Year:
2011

Citing 16
Cited 7

Query evaluation techniques for large databases

ACM Computing Surveys (CSUR)
Practical lessons in supporting large-scale computational science

ACM SIGMOD Record
Ubiquitous B-Tree

ACM Computing Surveys (CSUR)
Database--Principles, Programming and Performance

Database--Principles, Programming and Performance
Implementation techniques for main memory database systems

SIGMOD '84 Proceedings of the 1984 ACM SIGMOD international conference on Management of data
A Single-User Performance Evaluation of the Teradata Database Machine

Proceedings of the 2nd International Workshop on High Performance Transaction Systems
Model 204 Architecture and Performance

Proceedings of the 2nd International Workshop on High Performance Transaction Systems
T-Tree or B-Tree: Main Memory Database Index Structure Revisited

ADC '00 Proceedings of the Australasian Database Conference
Optimizing bitmap indices with efficient compression

ACM Transactions on Database Systems (TODS)
HDF5-FastQuery: Accelerating Complex Queries on HDF Datasets using Fast Bitmap Indices

SSDBM '06 Proceedings of the 18th International Conference on Scientific and Statistical Database Management
Detecting distributed scans using high-performance query-driven visualization

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
High performance multivariate visual data exploration for extremely large data

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
A practical scalable distributed B-tree

Proceedings of the VLDB Endowment
Principles of Distributed Database Systems

Principles of Distributed Database Systems
Analyses of multi-level and multi-component compressed bitmap indexes

ACM Transactions on Database Systems (TODS)
Scientific Data Management: Challenges, Technology, and Deployment

Scientific Data Management: Challenges, Technology, and Deployment

Federal market information technology in the post flash crash era: roles for supercomputing

Proceedings of the fourth workshop on High performance computational finance
Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
Parallel I/O, analysis, and visualization of a trillion particle simulation

SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Taming massive distributed datasets: data sampling using bitmap indices

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Scalable in situ scientific data encoding for analytical query processing

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Optimizing fastquery performance on lustre file system

Proceedings of the 25th International Conference on Scientific and Statistical Database Management
SDQuery DSI: integrating data management support with a wide area data transfer protocol

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Modern scientific datasets present numerous data management and analysis challenges. State-of-the-art index and query technologies are critical for facilitating interactive exploration of large datasets, but numerous challenges remain in terms of designing a system for processing general scientific datasets. The system needs to be able to run on distributed multi-core platforms, efficiently utilize underlying I/O infrastructure, and scale to massive datasets. We present FastQuery, a novel software framework that address these challenges. FastQuery utilizes a state-of-the-art index and query technology (FastBit) and is designed to process massive datasets on modern supercomputing platforms. We apply FastQuery to processing of a massive 50TB dataset generated by a large scale accelerator modeling code. We demonstrate the scalability of the tool to 11,520 cores. Motivated by the scientific need to search for interesting particles in this dataset, we use our framework to reduce search time from hours to tens of seconds.