SOSP '87 Proceedings of the eleventh ACM Symposium on Operating systems principles
GTW: a time warp system for shared memory multiprocessors
WSC '94 Proceedings of the 26th conference on Winter simulation
Parallel simulation of chip-multiprocessor architectures
ACM Transactions on Modeling and Computer Simulation (TOMACS)
The M5 Simulator: Modeling Networked Systems
IEEE Micro
Distributed Simulation: A Case Study in Design and Verification of Distributed Programs
IEEE Transactions on Software Engineering
An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors
HPCA '07 Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture
FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators
Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture
Parallelization of IBM mambo system simulator in functional modes
ACM SIGOPS Operating Systems Review
ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs
ACM Transactions on Reconfigurable Technology and Systems (TRETS)
Rigel: an architecture and scalable programming interface for a 1000-core accelerator
Proceedings of the 36th annual international symposium on Computer architecture
ACM SIGARCH Computer Architecture News
SlackSim: a platform for parallel simulations of CMPs on CMPs
ACM SIGARCH Computer Architecture News
PADS '10 Proceedings of the 2010 IEEE Workshop on Principles of Advanced and Distributed Simulation
Hi-index | 0.00 |
This paper addresses the workload partition strategies in the simulation of manycore architectures. The key observation behind this paper is that, compared to traditional multicores, manycores feature more non-uniform memory access and unpredictable network traffic; these features degrades simulation speed and accuracy of Parallel Discrete Event Simulators (PDES) when one uses static workload partition schemes. Based on the observation, we propose an adaptive workload partition method: Core/Router-Adaptive Workload Partition (CRAW/P). The method delivers more speedup and accuracy than static partition schemes by partitioning the simulation of on-chip-network independently from that of the cores and by synchronizing them differently. Using a PDES simulator, we evaluate the performance of CRAW/P in simulating a 256-core general purpose many-core processor. Running SPLASH2 benchmark applications, the experimental results demonstrate it can deliver speed improvement by 28%˜67% over static partition scheme and reduces timing errors to