Efficiently parallelizing instruction set simulation of embedded multi-core processors using region-based just-in-time dynamic binary translation

Authors:
Stephen Kyle;Igor Böhm;Björn Franke;Hugh Leather;Nigel Topham
Affiliations:
University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom
Venue:
Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
Year:
2012

Citing 31
Cited 0

Mimic: a fast system/370 simulator

SIGPLAN '87 Papers of the Symposium on Interpreters and interpretive techniques
Shade: a fast instruction-set simulator for execution profiling

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Embra: fast and flexible machine simulation

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
Does “just in time” = “better late than never”?

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
DAISY: dynamic compilation for 100% architectural compatibility

Proceedings of the 24th annual international symposium on Computer architecture
A retargetable, ultra-fast instruction set simulator

DATE '99 Proceedings of the conference on Design, automation and test in Europe
Dynamo: a transparent dynamic optimization system

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Partial method compilation using dynamic profile information

OOPSLA '01 Proceedings of the 16th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
A universal technique for fast and flexible instruction-set architecture simulation

Proceedings of the 39th annual Design Automation Conference
Dynamic and Transparent Binary Translation

Computer
Instruction set compiled simulation: a technique for fast and flexible instruction set simulation

Proceedings of the 40th annual Design Automation Conference
Automatic Synthesis of High-Speed Processor Simulators

Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture
A region-based compilation technique for dynamic compilers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Reducing dynamic compilation overhead by overlapping compilation and execution

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
Thread-Shared Software Code Caches

Proceedings of the International Symposium on Code Generation and Optimization
HotpathVM: an effective JIT compiler for resource-constrained devices

Proceedings of the 2nd international conference on Virtual execution environments
A multiprocessing approach to accelerate retargetable and portable dynamic-compiled instruction-set simulation

CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
QEMU, a fast and portable dynamic translator

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
YETI: a graduallY extensible trace interpreter

Proceedings of the 3rd international conference on Virtual execution environments
Dynamic compilation: the benefits of early investing

Proceedings of the 3rd international conference on Virtual execution environments
The java hotspotTM server compiler

JVM'01 Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1
A parallel dynamic compiler for CIL bytecode

ACM SIGPLAN Notices
Trace-based just-in-time type specialization for dynamic languages

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Phase detection using trace compilation

PPPJ '09 Proceedings of the 7th International Conference on Principles and Practice of Programming in Java
Trace-based compilation in execution environments without interpreters

Proceedings of the 8th International Conference on the Principles and Practice of Programming in Java
Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator

Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
JIT Compilation Policy on Single-Core and Multi-core Machines

INTERACT '11 Proceedings of the 2011 15th Workshop on Interaction between Compilers and Computer Architectures
Trace-based compilation for the Java HotSpot virtual machine

Proceedings of the 9th International Conference on Principles and Practice of Programming in Java
Dynamically accelerating client-side web applications through decoupled execution

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
A trace-based Java JIT compiler retrofitted from a method-based compiler

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Embedded systems, as typified by modern mobile phones, are already seeing a drive toward using multi-core processors. The number of cores will likely increase rapidly in the future. Engineers and researchers need to be able to simulate systems, as they are expected to be in a few generations time, running simulations of many-core devices on today's multi-core machines. These requirements place heavy demands on the scalability of simulation engines, the fastest of which have typically evolved from just-in-time (Jit) dynamic binary translators (Dbt). Existing work aimed at parallelizing Dbt simulators has focused exclusively on trace-based Dbt, wherein linear execution traces or perhaps trees thereof are the units of translation. Region-based Dbt simulators have not received the same attention and require different techniques than their trace-based cousins. In this paper we develop an innovative approach to scaling multi-core, embedded simulation through region-based Dbt. We initially modify the Jit code generator of such a simulator to emit code that does not depend on a particular thread with its thread-specific context and is, therefore, thread-agnostic. We then demonstrate that this thread-agnostic code generation is comparable to thread-specific code with respect to performance, but also enables the sharing of JIT-compiled regions between different threads. This sharing optimisation, in turn, leads to significant performance improvements for multi-threaded applications. In fact, our results confirm that an average of 76% of all JIT-compiled regions can be shared between 128 threads in representative, parallel workloads. We demonstrate that this translates into an overall performance improvement by 1.44x on average and up to 2.40x across 12 multi-threaded benchmarks taken from the Splash-2 benchmark suite, targeting our high-performance multi-core Dbt simulator for embedded Arc processors running on a 4-core Intel host machine.