Can PDES scale in environments with heterogeneous delays?

Authors:
Jingjing Wang;Ketan Bahulkar;Dmitry Ponomarev;Nael Abu-Ghazaleh
Affiliations:
Binghamton University, Binghamton, NY, USA;Binghamton University, Binghamton, NY, USA;Binghamton University, Binghamton, NY, USA;Binghamton University, Binghamton, NY, USA
Venue:
Proceedings of the 2013 ACM SIGSIM conference on Principles of advanced discrete simulation
Year:
2013

Citing 29
Cited 0

Compile-time partitioning and scheduling of parallel programs

SIGPLAN '86 Proceedings of the 1986 SIGPLAN symposium on Compiler construction
Parallel discrete event simulation

Communications of the ACM - Special issue on simulation
Effect of communication overheads on Time Warp performance: an experimental study

PADS '94 Proceedings of the eighth workshop on Parallel and distributed simulation
Parallel logic simulation of VLSI systems

ACM Computing Surveys (CSUR)
A case study in simulating PCS networks using Time Warp

PADS '95 Proceedings of the ninth workshop on Parallel and distributed simulation
Dynamic load balancing strategies for conservative parallel simulations

Proceedings of the eleventh workshop on Parallel and distributed simulation
Optimizing communication in time-warp simulators

PADS '98 Proceedings of the twelfth workshop on Parallel and distributed simulation
Time Warp simulation on clumps

PADS '99 Proceedings of the thirteenth workshop on Parallel and distributed simulation
ROSS: a high-performance, low memory, modular time warp system

PADS '00 Proceedings of the fourteenth workshop on Parallel and distributed simulation
Optimizing Message Delivery in Asynchronous Distributed Applications

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
On Metrics for the Dynamic Load Balancing of Optimistic Simulations

HICSS '99 Proceedings of the Thirty-second Annual Hawaii International Conference on System Sciences-Volume 8 - Volume 8
Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Scaling time warp-based discrete event execution to 104 processors on a Blue Gene supercomputer

Proceedings of the 4th international conference on Computing frontiers
A Design-Driven Partitioning Algorithm for Distributed Verilog Simulation

Proceedings of the 21st International Workshop on Principles of Advanced and Distributed Simulation
A Flexible Dynamic Partitioning Algorithm for Optimistic Distributed Simulation

Proceedings of the 21st International Workshop on Principles of Advanced and Distributed Simulation
Scalable Time Warp on Blue Gene Supercomputers

PADS '09 Proceedings of the 2009 ACM/IEEE/SCS 23rd Workshop on Principles of Advanced and Distributed Simulation
Automatic parallelization of simulink applications

Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization
Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Performance Evaluation of PDES on Multi-core Clusters

DS-RT '10 Proceedings of the 2010 IEEE/ACM 14th International Symposium on Distributed Simulation and Real Time Applications
A Well-Balanced Time Warp System on Multi-Core Environments

PADS '11 Proceedings of the 2011 IEEE Workshop on Principles of Advanced and Distributed Simulation
Multithreaded Global Address Space Communication Techniques for Gyrokinetic Fusion Applications on Ultra-Scale Platforms

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Optimization of Parallel Discrete Event Simulator for Multi-core Systems

IPDPS '12 Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium
Hierarchical Composite Synchronization

PADS '12 Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation
Dynamically Adjusting Core Frequencies to Accelerate Time Warp Simulations in Many-Core Processors

PADS '12 Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation
Characterizing and Understanding PDES Behavior on Tilera Architecture

PADS '12 Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation
Performance Analysis of a Multithreaded PDES Simulator on Multicore Clusters

PADS '12 Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation
Towards Symmetric Multi-threaded Optimistic Simulation Kernels

PADS '12 Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation
Partitioning on Dynamic Behavior for Parallel Discrete Event Simulation

PADS '12 Proceedings of the 2012 ACM/IEEE/SCS 26th Workshop on Principles of Advanced and Distributed Simulation
Assessing load-sharing within optimistic simulation platforms

Proceedings of the Winter Simulation Conference

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance and scalability of Parallel Discrete Event Simulation (PDES) is often limited by communication latencies and overheads. The emergence of multi-core processors and their expected evolution into many-cores offers the promise of low latency communication and tight memory integration between cores; these properties should significantly improve the performance of PDES in such environments. However, on clusters of multi-cores (CMs), the latency and processing overheads incurred when communicating between different machines (nodes) far outweigh those between cores on the same chip, especially when commodity networking fabrics and communication software are used. It is unclear if there is any benefit to the low latency among cores on the same node given that communication links across nodes are significantly worse. In this study, we examine the performance of a multi-threaded implementation of PDES on CMs. We demonstrate that the inter-node communication costs impose a substantial bottleneck on PDES and demonstrate that without optimizations addressing these long latencies, multi-threaded PDES does not significantly outperform the multiprocess version despite direct communication through shared memory on the individual nodes. We then propose three optimizations: message consolidation and routing, infrequent polling and latency-sensitive model partitioning. We show that with these optimizations in place, threaded implementation of PDES significantly outperforms process-based implementation even on CMs.