Kepler: An Extensible System for Design and Execution of Scientific Workflows
SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles
Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Mining for misconfigured machines in grid systems
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Pegasus: A framework for mapping complex scientific workflows onto distributed systems
Scientific Programming
ASKALON: A Grid Application Development and Computing Environment
GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A Lightweight Middleware Monitor for Distributed Scientific Workflows
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
Predicting the execution time of grid workflow applications through local learning
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A taxonomy of grid monitoring systems
Future Generation Computer Systems
Online Fault and Anomaly Detection for Large-Scale Scientific Workflows
HPCC '11 Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
Performance monitoring and visualization of grid scientific workflows in ASKALON
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
PerfSONAR: a service oriented architecture for multi-domain network monitoring
ICSOC'05 Proceedings of the Third international conference on Service-Oriented Computing
Online workflow management and performance analysis with stampede
Proceedings of the 7th International Conference on Network and Services Management
Failure analysis of distributed scientific workflows executing in the cloud
Proceedings of the 8th International Conference on Network and Service Management
User-steering of HPC workflows: state-of-the-art and future directions
Proceedings of the 2nd ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Characterizing workflow-based activity on a production e-infrastructure using provenance data
Future Generation Computer Systems
A Case Study into Using Common Real-Time Workflow Monitoring Infrastructure for Scientific Workflows
Journal of Grid Computing
Analysing Quality of Resilience in Fish4Knowledge Video Analysis Workflows
UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Runtime Dynamic Structural Changes of Scientific Workflows in Clouds
UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Hi-index | 0.00 |
Scientific workflows provide a portable representation for scientific applications' coordinated input, output, and execution management for highly parallel executions of interdependent computations, as well as support for sharing and validating the results. As scientific workflows scale to hundreds of thousands of distinct tasks, failures due to software and hardware faults become increasingly common. Real-time execution monitoring provides a foundation for improving the transparency and resilience of the workflows in the face of stochastic and systematic faults. Building on previous work on early detection of these failure scenarios, we describe methods for guiding remediation to stochastic errors through predictions of the impact on application performance. To complement this analysis, we also describe techniques for isolating systematic sources of failures. We evaluate our methods on a representative sample of large real-world workflows.