Mimic: a fast system/370 simulator
SIGPLAN '87 Papers of the Symposium on Interpreters and interpretive techniques
Shade: a fast instruction-set simulator for execution profiling
SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The SPLASH-2 programs: characterization and methodological considerations
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Embra: fast and flexible machine simulation
Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Does “just in time” = “better late than never”?
Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
DAISY: dynamic compilation for 100% architectural compatibility
Proceedings of the 24th annual international symposium on Computer architecture
A retargetable, ultra-fast instruction set simulator
DATE '99 Proceedings of the conference on Design, automation and test in Europe
Dynamo: a transparent dynamic optimization system
PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Partial method compilation using dynamic profile information
OOPSLA '01 Proceedings of the 16th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
A universal technique for fast and flexible instruction-set architecture simulation
Proceedings of the 39th annual Design Automation Conference
Instruction set compiled simulation: a technique for fast and flexible instruction set simulation
Proceedings of the 40th annual Design Automation Conference
Automatic Synthesis of High-Speed Processor Simulators
Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
A region-based compilation technique for dynamic compilers
ACM Transactions on Programming Languages and Systems (TOPLAS)
Reducing dynamic compilation overhead by overlapping compilation and execution
ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Thread-Shared Software Code Caches
Proceedings of the International Symposium on Code Generation and Optimization
HotpathVM: an effective JIT compiler for resource-constrained devices
Proceedings of the 2nd international conference on Virtual execution environments
CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
QEMU, a fast and portable dynamic translator
ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
YETI: a graduallY extensible trace interpreter
Proceedings of the 3rd international conference on Virtual execution environments
Dynamic compilation: the benefits of early investing
Proceedings of the 3rd international conference on Virtual execution environments
The java hotspotTM server compiler
JVM'01 Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1
A parallel dynamic compiler for CIL bytecode
ACM SIGPLAN Notices
Trace-based just-in-time type specialization for dynamic languages
Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Phase detection using trace compilation
PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
Trace-based compilation in execution environments without interpreters
Proceedings of the 8th International Conference on the Principles and Practice of Programming in Java
Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
JIT Compilation Policy on Single-Core and Multi-core Machines
INTERACT '11 Proceedings of the 2011 15th Workshop on Interaction between Compilers and Computer Architectures
Trace-based compilation for the Java HotSpot virtual machine
Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Dynamically accelerating client-side web applications through decoupled execution
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
A trace-based Java JIT compiler retrofitted from a method-based compiler
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hi-index | 0.00 |
Embedded systems, as typified by modern mobile phones, are already seeing a drive toward using multi-core processors. The number of cores will likely increase rapidly in the future. Engineers and researchers need to be able to simulate systems, as they are expected to be in a few generations time, running simulations of many-core devices on today's multi-core machines. These requirements place heavy demands on the scalability of simulation engines, the fastest of which have typically evolved from just-in-time (Jit) dynamic binary translators (Dbt). Existing work aimed at parallelizing Dbt simulators has focused exclusively on trace-based Dbt, wherein linear execution traces or perhaps trees thereof are the units of translation. Region-based Dbt simulators have not received the same attention and require different techniques than their trace-based cousins. In this paper we develop an innovative approach to scaling multi-core, embedded simulation through region-based Dbt. We initially modify the Jit code generator of such a simulator to emit code that does not depend on a particular thread with its thread-specific context and is, therefore, thread-agnostic. We then demonstrate that this thread-agnostic code generation is comparable to thread-specific code with respect to performance, but also enables the sharing of JIT-compiled regions between different threads. This sharing optimisation, in turn, leads to significant performance improvements for multi-threaded applications. In fact, our results confirm that an average of 76% of all JIT-compiled regions can be shared between 128 threads in representative, parallel workloads. We demonstrate that this translates into an overall performance improvement by 1.44x on average and up to 2.40x across 12 multi-threaded benchmarks taken from the Splash-2 benchmark suite, targeting our high-performance multi-core Dbt simulator for embedded Arc processors running on a 4-core Intel host machine.