Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator

Authors:
Igor Böhm;Tobias J.K. Edler von Koch;Stephen C. Kyle;Björn Franke;Nigel Topham
Affiliations:
University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom;University of Edinburgh, Edinburgh, United Kingdom
Venue:
Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation
Year:
2011

Citing 28
Cited 11

Mimic: a fast system/370 simulator

SIGPLAN '87 Papers of the Symposium on Interpreters and interpretive techniques
Shade: a fast instruction-set simulator for execution profiling

SIGMETRICS '94 Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems
A retargetable, ultra-fast instruction set simulator

DATE '99 Proceedings of the conference on Design, automation and test in Europe
Dynamo: a transparent dynamic optimization system

PLDI '00 Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation
Partial method compilation using dynamic profile information

OOPSLA '01 Proceedings of the 16th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
A universal technique for fast and flexible instruction-set architecture simulation

Proceedings of the 39th annual Design Automation Conference
PA-RISC to IA-64: Transparent Execution, No Recompilation

Computer
FX!32: A Profile-Directed Binary Translator

IEEE Micro
Instruction set compiled simulation: a technique for fast and flexible instruction set simulation

Proceedings of the 40th annual Design Automation Conference
A brief history of just-in-time

ACM Computing Surveys (CSUR)
LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation

Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
Vertical profiling: understanding the behavior of object-priented applications

OOPSLA '04 Proceedings of the 19th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
Pin: building customized program analysis tools with dynamic instrumentation

Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation
A region-based compilation technique for dynamic compilers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Reducing dynamic compilation overhead by overlapping compilation and execution

ASP-DAC '06 Proceedings of the 2006 Asia and South Pacific Design Automation Conference
HotpathVM: an effective JIT compiler for resource-constrained devices

Proceedings of the 2nd international conference on Virtual execution environments
A multiprocessing approach to accelerate retargetable and portable dynamic-compiled instruction-set simulation

CODES+ISSS '06 Proceedings of the 4th international conference on Hardware/software codesign and system synthesis
SPEC CPU2006 benchmark descriptions

ACM SIGARCH Computer Architecture News
Ultra fast cycle-accurate compiled emulation of inorder pipelined architectures

Journal of Systems Architecture: the EUROMICRO Journal
QEMU, a fast and portable dynamic translator

ATEC '05 Proceedings of the annual conference on USENIX Annual Technical Conference
YETI: a graduallY extensible trace interpreter

Proceedings of the 3rd international conference on Virtual execution environments
Parallelization of IBM mambo system simulator in functional modes

ACM SIGOPS Operating Systems Review
A parallel dynamic compiler for CIL bytecode

ACM SIGPLAN Notices
High Speed CPU Simulation Using LTU Dynamic Binary Translation

HiPEAC '09 Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers
Trace-based just-in-time type specialization for dynamic languages

Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation
Native Client: A Sandbox for Portable, Untrusted x86 Native Code

SP '09 Proceedings of the 2009 30th IEEE Symposium on Security and Privacy
Processor virtualization and split compilation for heterogeneous multicore embedded systems

Proceedings of the 47th Design Automation Conference
Trace-based compilation in execution environments without interpreters

Proceedings of the 8th International Conference on the Principles and Practice of Programming in Java

JIT compilation policy for modern machines

Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications
Efficiently parallelizing instruction set simulation of embedded multi-core processors using region-based just-in-time dynamic binary translation

Proceedings of the 13th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory for Embedded Systems
HQEMU: a multi-threaded and retargetable dynamic binary translator on multicores

Proceedings of the Tenth International Symposium on Code Generation and Optimization
Compilation queuing and graph caching for dynamic compilers

Proceedings of the sixth ACM workshop on Virtual machines and intermediate languages
Limits of region-based dynamic binary parallelization

Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Improving dynamic binary optimization through early-exit guided code region formation

Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Early partial evaluation in a JIT-compiled, retargetable instruction set simulator generated from a high-level architecture description

Proceedings of the 50th Annual Design Automation Conference
Tracing compilation by abstract interpretation

Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
Exploring single and multilevel JIT compilation policy for modern machines 1

ACM Transactions on Architecture and Code Optimization (TACO)
Finding the limit: examining the potential and complexity of compilation scheduling for JIT-based runtime systems

Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
JIT technology with C/C++: Feedback-directed dynamic recompilation for statically compiled languages

ACM Transactions on Architecture and Code Optimization (TACO)

Quantified Score

Hi-index	0.00

Visualization

Abstract

Dynamic Binary Translation (DBT) is the key technology behind cross-platform virtualization and allows software compiled for one Instruction Set Architecture (ISA) to be executed on a processor supporting a different ISA. Under the hood, DBT is typically implemented using Just-In-Time (JIT) compilation of frequently executed program regions, also called traces. The main challenge is translating frequently executed program regions as fast as possible into highly efficient native code. As time for JIT compilation adds to the overall execution time, the JIT compiler is often decoupled and operates in a separate thread independent from the main simulation loop to reduce the overhead of JIT compilation. In this paper we present two innovative contributions. The first contribution is a generalized trace compilation approach that considers all frequently executed paths in a program for JIT compilation, as opposed to previous approaches where trace compilation is restricted to paths through loops. The second contribution reduces JIT compilation cost by compiling several hot traces in a concurrent task farm. Altogether we combine generalized light-weight tracing, large translation units, parallel JIT compilation and dynamic work scheduling to ensure timely and efficient processing of hot traces. We have evaluated our industry-strength, LLVM-based parallel DBT implementing the ARCompact ISA against three benchmark suites (EEMBC, BioPerf and SPEC CPU2006) and demonstrate speedups of up to 2.08 on a standard quad-core Intel Xeon machine. Across short- and long-running benchmarks our scheme is robust and never results in a slowdown. In fact, using four processors total execution time can be reduced by on average 11.5% over state-of-the-art decoupled, parallel (or asynchronous) JIT compilation.