Autonomous, failure-resilient orchestration of distributed discrete event simulations

Authors:
Matthew Malensek;Zhiquan Sui;Neil Harvey;Shrideep Pallickara
Affiliations:
Colorado State University, Fort Collins, CO;Colorado State University, Fort Collins, CO;University of Guelph, Guelph, ON, Canada;Colorado State University, Fort Collins, CO
Venue:
Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
Year:
2013

Citing 14
Cited 0

Parallel discrete event simulation

Communications of the ACM - Special issue on simulation
On scalable and efficient distributed failure detectors

Proceedings of the twentieth annual ACM symposium on Principles of distributed computing
Coordinated Decentralized Protocols for Failure Diagnosisof Discrete Event Systems

Discrete Event Dynamic Systems
Pinpoint: Problem Determination in Large, Dynamic Internet Services

DSN '02 Proceedings of the 2002 International Conference on Dependable Systems and Networks
Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,

Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,
Engineering a Lightweight Suffix Array Construction Algorithm

Algorithmica
Rewind, repair, replay: three R's to dependability

EW 10 Proceedings of the 10th workshop on ACM SIGOPS European workshop
Microreboot — A technique for cheap recovery

OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6
MapReduce: simplified data processing on large clusters

Communications of the ACM - 50th anniversary issue: 1958 - 2008
Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

IEEE Transactions on Parallel and Distributed Systems
Analyzing Electroencephalograms Using Cloud Computing Techniques

CLOUDCOM '10 Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science
Decentralized failure diagnosis of discrete event systems

IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans
On the performance of high dimensional data clustering and classification algorithms

Future Generation Computer Systems
Exploiting geospatial and chronological characteristics in data streams to enable efficient storage and retrievals

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of relevant events and conditions naturally provides a more accurate model, but also increases the computational workload associated with the simulation. To manage these processing requirements in a scalable manner, a discrete event simulation can be distributed across a number of computing resources. However, individual tasks in the simulation are stateful, and therefore require inter-task communication and synchronization to produce an accurate model. This property not only complicates the orchestration of the discrete event simulation in a distributed setting, but also makes providing reliable, fault-tolerant execution a challenge, especially when compared to conventional distributed fault tolerance schemes. In this paper, we propose an autonomous agent that provides fault tolerance functionality for discrete event simulations by predicting state changes in the simulation and adjusting its fault tolerance policy accordingly. This allows the system to avoid negatively impacting overall execution times while preserving reliability guarantees. To underscore the viability of our solution, we provide benchmarks of a production discrete event simulation that can sustain failures while running under the supervision of our fault tolerance framework.