Vectorization technology to improve interpreter performance

Authors:
Erven Rohou;Kevin Williams;David Yuste
Affiliations:
Inria Rennes -- Bretagne Atlantique;Inria Rennes -- Bretagne Atlantique;Inria Rennes -- Bretagne Atlantique
Venue:
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
Year:
2013

Citing 15
Cited 0

Threaded code

Communications of the ACM
A scalable cross-platform infrastructure for application performance tuning using hardware counters

Proceedings of the 2000 ACM/IEEE conference on Supercomputing
The Behavior of Efficient Virtual Machine Interpreters on Modern Architectures

Euro-Par '01 Proceedings of the 7th International Euro-Par Conference Manchester on Parallel Processing
Optimizing indirect branch prediction accuracy in virtual machine interpreters

PLDI '03 Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation
Practical Java Game Programming (Game Development Series)

Practical Java Game Programming (Game Development Series)
Combining stack caching with dynamic superinstructions

Proceedings of the 2004 workshop on Interpreters, virtual machines and emulators
Multi-platform Auto-vectorization

Proceedings of the International Symposium on Code Generation and Optimization
MiBench: A free, commercially representative embedded benchmark suite

WWC '01 Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop
The java hotspotTM server compiler

JVM'01 Proceedings of the 2001 Symposium on JavaTM Virtual Machine Research and Technology Symposium - Volume 1
Validity of the single processor approach to achieving large scale computing capabilities

AFIPS '67 (Spring) Proceedings of the April 18-20, 1967, spring joint computer conference
A highly flexible, parallel virtual machine: design and experience of ILDJIT

Software—Practice & Experience
Vectorization for Java

NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework

Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Speculatively vectorized bytecode

Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers
Vapor SIMD: Auto-vectorize once, run everywhere

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

In the present computing landscape, interpreters are in use in a wide range of systems. Recent trends in consumer electronics have created a new category of portable, lightweight software applications. Typically, these applications have fast development cycles and short life spans. They run on a wide range of systems and are deployed in a target independent bytecode format over Internet and cellular networks. Their authors are untrusted third-party vendors, and they are executed in secure managed runtimes or virtual machines. Furthermore, due to security policies or development time constraints, these virtual machines often lack just-in-time compilers and rely on interpreted execution. At the other end of the spectrum, interpreters are also a reality in the field of high-performance computations because of the flexibility they provide. The main performance penalty in interpreters arises from instruction dispatch. Each bytecode requires a minimum number of machine instructions to be executed. In this work, we introduce a novel approach for interpreter optimization that reduces instruction dispatch thanks to vectorization technology. We extend the split compilation paradigm to interpreters, thus guaranteeing that our approach exhibits almost no overhead at runtime. We take advantage of the vast research in vectorization and its presence in modern compilers. Complex analyses are performed ahead of time, and their results are conveyed to the executable bytecode. At runtime, the interpreter retrieves this additional information to build the SIMD IR (intermediate representation) instructions that carry the vector semantics. The bytecode language remains unmodified, making this representation compatible with legacy interpreters and previously proposed JIT compilers. We show that this approach drastically reduces the number of instructions to interpret and decreases execution time of vectorizable applications. Moreover, we map SIMD IR instructions to hardware SIMD instructions when available, with a substantial additional improvement. Finally, we finely analyze the impact of our extension on the behavior of the caches and branch predictors.