Statecharts: A visual formalism for complex systems
Science of Computer Programming
Communicating sequential processes
Communications of the ACM
NetLogger: A Toolkit for Distributed System Performance Analysis
MASCOTS '00 Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems
TelegraphCQ: continuous dataflow processing
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Kepler: An Extensible System for Design and Execution of Scientific Workflows
SSDBM '04 Proceedings of the 16th International Conference on Scientific and Statistical Database Management
A large-scale study of failures in high-performance computing systems
DSN '06 Proceedings of the International Conference on Dependable Systems and Networks
Taverna: lessons in creating a workflow environment for the life sciences: Research Articles
Concurrency and Computation: Practice & Experience - Workflow in Grid Systems
Pegasus: A framework for mapping complex scientific workflows onto distributed systems
Scientific Programming
Performance metrics and ontologies for Grid workflows
Future Generation Computer Systems
ASKALON: A Grid Application Development and Computing Environment
GRID '05 Proceedings of the 6th IEEE/ACM International Workshop on Grid Computing
Toward a Commodity Enterprise Middleware
Queue - API Design
Connecting Scientific Data to Scientific Experiments with Provenance
E-SCIENCE '07 Proceedings of the Third IEEE International Conference on e-Science and Grid Computing
Special Issue: The First Provenance Challenge
Concurrency and Computation: Practice & Experience - The First Provenance Challenge
Exploring event correlation for failure prediction in coalitions of clusters
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
A Lightweight Middleware Monitor for Distributed Scientific Workflows
CCGRID '08 Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid
Provenance: The Bridge Between Experiments and Data
Computing in Science and Engineering
ESCIENCE '08 Proceedings of the 2008 Fourth IEEE International Conference on eScience
A taxonomy of grid monitoring systems
Future Generation Computer Systems
Large-scale incremental processing using distributed transactions and notifications
OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation
Online Fault and Anomaly Detection for Large-Scale Scientific Workflows
HPCC '11 Proceedings of the 2011 IEEE International Conference on High Performance Computing and Communications
Performance monitoring and visualization of grid scientific workflows in ASKALON
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
PerfSONAR: a service oriented architecture for multi-domain network monitoring
ICSOC'05 Proceedings of the Third international conference on Service-Oriented Computing
Provenance collection support in the kepler scientific workflow system
IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Performance evaluation of the karma provenance framework for scientific workflows
IPAW'06 Proceedings of the 2006 international conference on Provenance and Annotation of Data
Failure prediction and localization in large scientific workflows
Proceedings of the 6th workshop on Workflows in support of large-scale science
Oozie: towards a scalable workflow management system for Hadoop
Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies
Failure analysis of distributed scientific workflows executing in the cloud
Proceedings of the 8th International Conference on Network and Service Management
Bundle and Pool Architecture for Multi-Language, Robust, Scalable Workflow Executions
Journal of Grid Computing
A Case Study into Using Common Real-Time Workflow Monitoring Infrastructure for Scientific Workflows
Journal of Grid Computing
Analysing Quality of Resilience in Fish4Knowledge Video Analysis Workflows
UCC '13 Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing
Hi-index | 0.00 |
Scientific workflows are an enabler of complex scientific analyses. They provide both a portable representation and a foundation upon which results can be validated and shared. Large-scale scientific workflows are executed on equally complex parallel and distributed resources, where many things can fail. Application scientists need to track the status of their workflows in real time, detect execution anomalies automatically, and perform troubleshooting -- without logging into remote nodes or searching through thousands of log files. As part of the NSF Stampede project, we have developed an infrastructure to answer these needs. The infrastructure captures application-level logs and resource information, normalizes these to standard representations, and stores these logs in a centralized general-purpose schema. Higher-level tools mine the logs in real time to determine current status, predict failures, and detect anomalous performance.