Toward fine-grained online task characteristics estimation in scientific workflows

  • Authors:
  • Rafael Ferreira da Silva;Gideon Juve;Ewa Deelman;Tristan Glatard;Frédéric Desprez;Douglas Thain;Benjamin Tovar;Miron Livny

  • Affiliations:
  • University of Lyon, Villeurbanne, France and University of Southern California, Marina Del Rey, CA;University of Southern California, Marina Del Rey, CA;University of Southern California, Marina Del Rey, CA;University of Lyon, Villeurbanne, France and McGill University, Canada;University of Lyon, Lyon, France;University of Notre Dame, Notre Dame, IN;University of Notre Dame, Notre Dame, IN;University of Wisconsin Madison, Madison, WI

  • Venue:
  • WORKS '13 Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science
  • Year:
  • 2013

Quantified Score

Hi-index 0.00

Visualization

Abstract

Task characteristics estimations such as runtime, disk space, and memory consumption, are commonly used by scheduling algorithms and resource provisioning techniques to provide successful and efficient workflow executions. These methods assume that accurate estimations are available, but in production systems it is hard to compute such estimates with good accuracy. In this work, we first profile three real scientific workflows collecting fine-grained information such as process I/O, runtime, memory usage, and CPU utilization. We then propose a method to automatically characterize workflow task needs based on these profiles. Our method estimates task runtime, disk space, and memory consumption based on the size of tasks input data. It looks for correlations between the parameters of a dataset, and if no correlation is found, the dataset is divided into smaller subsets by using a clustering technique. Task behavior estimates are done based on the ratio parameter/input data size if they are correlated, or based on the mean value. However, task dependencies in scientific workflows lead to a chain of estimation errors. To correct such errors, we propose an online estimation process based on the MAPE-K loop where task executions are constantly monitored and estimates are updated accordingly. Experiment results show that our online estimation process yields much more accurate predictions than an offline approach, where all task needs are estimated at once.