Scaling up workflow-based applications

  • Authors:
  • Scott Callaghan;Ewa Deelman;Dan Gunter;Gideon Juve;Philip Maechling;Christopher Brooks;Karan Vahi;Kevin Milner;Robert Graves;Edward Field;David Okaya;Thomas Jordan

  • Affiliations:
  • University of Southern California, Los Angeles, CA 90089, United States;USC Information Sciences Institute, Marina Del Rey, CA 90292, United States;Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States;University of Southern California, Los Angeles, CA 90089, United States;University of Southern California, Los Angeles, CA 90089, United States;University of San Francisco, CA 94117, United States;USC Information Sciences Institute, Marina Del Rey, CA 90292, United States;University of Southern California, Los Angeles, CA 90089, United States;URS Corporation, Pasadena, CA 91101, United States;US Geological Survey, Pasadena, CA 91106, United States;University of Southern California, Los Angeles, CA 90089, United States;University of Southern California, Los Angeles, CA 90089, United States

  • Venue:
  • Journal of Computer and System Sciences
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

Scientific applications, often expressed as workflows are making use of large-scale national cyberinfrastructure to explore the behavior of systems, search for phenomena in large-scale data, and to conduct many other scientific endeavors. As the complexity of the systems being studied grows and as the data set sizes increase, the scale of the computational workflows increases as well. In some cases, workflows now have hundreds of thousands of individual tasks. Managing such scale is difficult from the point of view of workflow description, execution, and analysis. In this paper, we describe the challenges faced by workflow management and performance analysis systems when dealing with an earthquake science application, CyberShake, executing on the TeraGrid. The scientific goal of the SCEC CyberShake project is to calculate probabilistic seismic hazard curves for sites in Southern California. For each site of interest, the CyberShake platform includes two large-scale MPI calculations and approximately 840,000 embarrassingly parallel post-processing jobs. In this paper, we show how we approach the scalability challenges in our workflow management and log mining systems.