Design and analysis of data management in scalable parallel scripting

Authors:
Zhao Zhang;Daniel S. Katz;Justin M. Wozniak;Allan Espinosa;Ian Foster
Affiliations:
University of Chicago;University of Chicago & Argonne National Laboratory;Argonne National Laboratory;University of Chicago;University of Chicago & Argonne National Laboratory
Venue:
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Year:
2012

Citing 20
Cited 5

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
On implementing MPI-IO portably and with high performance

Proceedings of the sixth workshop on I/O in parallel and distributed systems
Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering

Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering
Scripting: Higher-Level Programming for the 21st Century

Computer
Data Sieving and Collective I/O in ROMIO

FRONTIERS '99 Proceedings of the The 7th Symposium on the Frontiers of Massively Parallel Computation
A Comparison of Two Methods for Building Astronomical Image Mosaics on a Grid

ICPPW '05 Proceedings of the 2005 International Conference on Parallel Processing Workshops
Productivity and performance using partitioned global address space languages

Proceedings of the 2007 international workshop on Parallel symbolic computation
Workflow task clustering for best effort systems with Pegasus

Proceedings of the 15th ACM Mardi Gras conference: From lightweight mash-ups to lambda grids: Understanding the spectrum of distributed computing requirements, applications, tools, infrastructures, interoperability, and the incremental adoption of key capabilities
Falkon: a Fast and Light-weight tasK executiON framework

Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Accelerating large-scale data exploration through data diffusion

DADC '08 Proceedings of the 2008 international workshop on Data-aware distributed computing
Massively parallel genomic sequence search on the Blue Gene/P architecture

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications

ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Montage: a grid portal and software toolkit for science-grade astronomical image mosaicking

International Journal of Computational Science and Engineering
Transforming MPI source code based on communication patterns

Future Generation Computer Systems
Parallel Scripting for Applications at the Petascale and Beyond

Computer
Case studies in storage access by loosely coupled petascale applications

Proceedings of the 4th Annual Workshop on Petascale Data Storage
AME: an anyscale many-task computing engine

Proceedings of the 6th workshop on Workflows in support of large-scale science
Integration of scheduling and replication in data grids

HiPC'04 Proceedings of the 11th international conference on High Performance Computing
Swift: A language for distributed parallel scripting

Parallel Computing
A Workflow-Aware Storage System: An Opportunity Study

CCGRID '12 Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)

MTC envelope: defining the capability of large scale computers in the context of parallel scripting applications

Proceedings of the 22nd international symposium on High-performance parallel and distributed computing
Parallelizing the execution of sequential scripts

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Understanding workflows for distributed computing: nitty-gritty details

WORKS '13 Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science
SDAFT: a novel scalable data access framework for parallel BLAST

DISCS-2013 Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
Predicting intermediate storage performance for workflow applications

PDSW '13 Proceedings of the 8th Parallel Data Storage Workshop

Quantified Score

Hi-index	0.00

Visualization

Abstract

We seek to enable efficient large-scale parallel execution of applications in which a shared filesystem abstraction is used to couple many tasks. Such parallel scripting (many-task computing, MTC) applications suffer poor performance and utilization on large parallel computers because of the volume of filesystem I/O and a lack of appropriate optimizations in the shared filesystem. Thus, we design and implement a scalable MTC data management system that uses aggregated compute node local storage for more efficient data movement strategies. We co-design the data management system with the data-aware scheduler to enable dataflow pattern identification and automatic optimization. The framework reduces the time to solution of parallel stages of an astronomy data analysis application, Montage, by 83.2% on 512 cores; decreases the time to solution of a seismology application, CyberShake, by 7.9% on 2,048 cores; and delivers BLAST performance better than mpiBLAST at various scales up to 32,768 cores, while preserving the flexibility of the original BLAST application.