Hybrid merge/overlap execution technique for parallel array processing

Authors:
Emad Soroush;Magdalena Balazinska
Affiliations:
University of Washington, Seattle;University of Washington, Seattle
Venue:
Proceedings of the EDBT/ICDT 2011 Workshop on Array Databases
Year:
2011

Citing 15
Cited 2

Global arrays: a nonuniform memory access programming model for high-performance computers

The Journal of Supercomputing
T2: a customizable parallel database for multi-dimensional data

ACM SIGMOD Record
The multidimensional database system RasDaMan

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Co-array Fortran for parallel programming

ACM SIGPLAN Fortran Forum
PARSIMONY: An infrastructure for parallel multidimensional analysis and data mining

Journal of Parallel and Distributed Computing - Special issue on high-performance data mining
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals

Data Mining and Knowledge Discovery
Multidimensional Database Technology

Computer
An extendible multidimensional array system for MOLAP

Proceedings of the 2006 ACM symposium on Applied computing
MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment
A demonstration of SciDB: a science-oriented DBMS

Proceedings of the VLDB Endowment
Skew-resistant parallel processing of feature-extracting scientific user-defined functions

Proceedings of the 1st ACM symposium on Cloud computing
Overview of sciDB: large scale array storage, processing and analysis

Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
ArrayStore: a storage manager for complex parallel array processing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
Distribution rules for array database queries

DEXA'05 Proceedings of the 16th international conference on Database and Expert Systems Applications

ArrayStore: a storage manager for complex parallel array processing

Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
SIDR: structure-aware intelligent data routing in Hadoop

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Whether in business or science, multi-dimensional arrays are a common abstraction in data analytics and many systems exist for efficiently processing arrays. As dataset grow in size, it is becoming increasingly important to process these arrays in parallel. In this paper, we discuss different types of array operations and review how they can be processed in parallel using two different existing techniques. The first technique, which we call merge, consists in partitioning an array, processing the partitions in parallel, then merging the results to reconcile computations that span partition boundaries. The second technique, which we call overlap, consists in partitioning an array into subarrays that overlap by a given number of cells along each dimension. Thanks to this overlap, the array partitions can be processed in parallel without any merge phase. We discuss when each technique can be applied to an array operation. We show that even for a single array operation, a different approach may yield the best performance for different regions of an array. Following this observation, we introduce a new parallel array processing technique that combines the merge and overlap approaches. Our technique enables a parallel array processing system to mix-and-match the merge and overlap techniques within a single operation on an array. Through experiments on real, scientific data, we show that this hybrid approach outperforms the other two techniques.