Active workflow system for near real-time extreme-scale science

Authors:
Yanwei Zhang;Qing Liu;Scott Klasky;Matthew Wolf;Karsten Schwan;Greg Eisenhauer;Jong Choi;Norbert Podhorszki
Affiliations:
College of Computing, Georgia Institute of Technology, Atlanta, GA, USA;Scientific Data Group, Oak Ridge National Laboratory, Oak Ridge, TN, USA;Scientific Data Group, Oak Ridge National Laboratory, Oak Ridge, TN, USA;College of Computing, Georgia Institute of Technology, Atlanta, GA, USA;College of Computing, Georgia Institute of Technology, Atlanta, GA, USA;College of Computing, Georgia Institute of Technology, Atlanta, GA, USA;Scientific Data Group, Oak Ridge National Laboratory, Oak Ridge, TN, USA;Scientific Data Group, Oak Ridge National Laboratory, Oak Ridge, TN, USA
Venue:
Proceedings of the first workshop on Parallel programming for analytics applications
Year:
2014

Citing 13
Cited 0

StreamIt: A Language for Streaming Applications

CC '02 Proceedings of the 11th International Conference on Compiler Construction
TelegraphCQ: continuous dataflow processing

Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Aurora: a new model and architecture for data stream management

The VLDB Journal — The International Journal on Very Large Data Bases
Operator scheduling in data stream systems

The VLDB Journal — The International Journal on Very Large Data Bases
Load management techniques for distributed stream processing

Load management techniques for distributed stream processing
SPC: a distributed, scalable platform for data mining

Proceedings of the 4th international workshop on Data mining standards, services and platforms
SODA: an optimizing scheduler for large-scale stream-based distributed computer systems

Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware
Elastic scaling of data parallel operators in stream processing

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Event-based systems: opportunities and challenges at exascale

Proceedings of the Third ACM International Conference on Distributed Event-Based Systems
COLA: optimizing stream processing applications via graph partitioning

Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware
A Type System for High Performance Communication and Computation

ESCIENCEW '11 Proceedings of the 2011 IEEE Seventh International Conference on e-Science Workshops
I/O Containers: Managing the Data Analytics and Visualization Pipelines of High End Codes

IPDPSW '13 Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum
Runtime I/O re-routing + throttling on HPC storage

HotStorage'13 Proceedings of the 5th USENIX conference on Hot Topics in Storage and File Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

In recent years, streaming-based data processing has been gaining substantial traction for dealing with overwhelming data generated by real-time applications, from both enterprise sources and scientific computing. In this work, however, we look at an emerging class of scientific data with Near Real-Time (NRT) requirement, in which data is typically generated in a bursty fashion with the near real-time constraints being applied primarily between bursts, rather than within a stream. A key challenge for this types of data sources is that the processing time per data element is not uniform, and not always feasible to predict. Given the observations on the increasing unpredictability of compute load and system dynamics, this work looks to adapt streaming-based approach to the context of this new class of large experiments and simulations that have complex run-time control and analysis issues. In particular, we deploy a novel two-tier scheme for handling the increasing unpredictability of runtime behaviors: Instead of relying on determining what and where to run the scientific workflows beforehand or partial dynamically, the decision will also be adaptively enhanced online according to system runtime status. This is enabled by embedding workflow along with data streams. Specifically, we break data outputs generated from experiments or simulations into multiple self-describing "chunks", which we call active data objects. As such, if there is a transient hotspot observed, a data object with unfinished workflow pipeline can break its previous schedule and search for a least loaded location to continue the execution. Our preliminary experiment results based on synthetic workloads demonstrate the proposed active workflow system as a very promising solution by outperforming the state-of-the-art semi-dynamic workflow schedulers with an improved workflow completion time, as well as a good scalability.