ISCA '88 Proceedings of the 15th Annual International Symposium on Computer architecture
On the validity of trace-driven simulation for multiprocessors
ISCA '91 Proceedings of the 18th annual international symposium on Computer architecture
The Wisconsin Wind Tunnel: virtual prototyping of parallel computers
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The accuracy of trace-driven simulations of multiprocessors
SIGMETRICS '93 Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems
ATOM: a system for building customized program analysis tools
PLDI '94 Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Full-system timing-first simulation
SIGMETRICS '02 Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Pin: building customized program analysis tools with dynamic instrumentation
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
Exploring the cache design space for large scale CMPs
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset
ACM SIGARCH Computer Architecture News - Special issue: dasCMP'05
Using Pin as a memory reference generator for multiprocessor simulation
ACM SIGARCH Computer Architecture News - Special issue on the 2005 workshop on binary instrumentation and application
The M5 Simulator: Modeling Networked Systems
IEEE Micro
A practical FPGA-based framework for novel CMP research
Proceedings of the 2007 ACM/SIGDA 15th international symposium on Field programmable gate arrays
Proceedings of the 16th international ACM/SIGDA symposium on Field programmable gate arrays
COTSon: infrastructure for full system simulation
ACM SIGOPS Operating Systems Review
International Journal of Reconfigurable Computing - Special issue on selected papers from the 17th reconfigurable architectures workshop (RAW2010)
Filtering directory lookups in CMPs with write-through caches
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part I
Filtering directory lookups in CMPs
Microprocessors & Microsystems
Fast architecture evaluation of heterogeneous MPSoCs by host-compiled simulation
Proceedings of the 15th International Workshop on Software and Compilers for Embedded Systems
ScalableCore system: a scalable many-core simulator by employing over 100 FPGAs
ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
CRAW/P: a workload partition method for the efficient parallel simulation of manycores
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Prototyping hardware support for irregular applications
Proceedings of the 2013 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools
Hi-index | 0.00 |
This paper proposes a novel methodology to efficiently simulate shared-memory multiprocessors composed of hundreds of cores. The basic idea is to use thread-level parallelism in the software system and translate it into corelevel parallelism in the simulated world. To achieve this, we first augment an existing full-system simulator to identify and separate the instruction streams belonging to the different software threads. Then, the simulator dynamically maps each instruction flow to the corresponding core of the target multi-core architecture, taking into account the inherent thread synchronization of the running applications. Our simulator allows a user to execute any multithreaded application in a conventional full-system simulator and evaluate the performance of the application on a many-core hardware. We carried out extensive simulations on the SPLASH-2 benchmark suite and demonstrated the scalability up to 1024 cores with limited simulation speed degradation vs. the single-core case on a fixed workload. The results also show that the proposed technique captures the intrinsic behavior of the SPLASH-2 suite, even when we scale up the number of shared-memory cores beyond the thousand-core limit.