Clustered Workflow Execution of Retargeted Data Analysis Scripts

Authors:
Daniel L. Wang;Charles S. Zender;Stephen F. Jenks
Affiliations:
-;-;-
Venue:
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Year:
2008

Citing 0
Cited 4

Searching workflows with hierarchical views

Proceedings of the VLDB Endowment
SciHadoop: array-based query processing in Hadoop

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
SDQuery DSI: integrating data management support with a wide area data transfer protocol

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

Supercomputing advances have enabled computational science data volumes to grow at ever increasing rates, commonly resulting in moredata produced than can be practically analyzed. Whole-dataset download costs have grown to impractical heights, even with multi-Gbps networks, forcing scientists to rely on server-side subsetting and limiting the scope of data they can analyze on a workstation. Our system supplements existing scientific data services with lightweight computational capability, providing a means of safely relocating analysis from the desktop to the server where clustered execution can be coordinated, exploiting data locality, reducing unnecessary data transfer, and providing end-users with results several times faster. We show how dataflow and other compiler-inspired analyses of shell scripts of scientists' most common analysis tools enables parallelization and optimizations in disk and network I/O bandwidth. We benchmark using an actual geoscience analysis script, illustrating the crucial performance gains of extracting workflows defined in scripts and optimizing their execution. Current results quantify significant improvements in performance, showing the promise of bringing transparent high-performance analysis to the scientist's desktop.