The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Compiler blockability of numerical algorithms
Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Tile size selection using cache organization and data layout
PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The Jalapeño dynamic optimizing compiler for Java
JAVA '99 Proceedings of the ACM 1999 conference on Java Grande
LAPACK Users' guide (third ed.)
LAPACK Users' guide (third ed.)
Adaptive optimization in the Jalapeño JVM
OOPSLA '00 Proceedings of the 15th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
High-level adaptive program optimization with ADAPT
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
IEEE Micro
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Minimizing development and maintenance costs in supporting persistently optimized BLAS
Software—Practice & Experience - Research Articles
Optimizing chip multiprocessor work distribution using dynamic compilation
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
Journal of Computational and Applied Mathematics
On-chip cache hierarchy-aware tile scheduling for multicore machines
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Hi-index | 0.00 |
Loop tiling is a fundamental optimization for improving data locality. Selecting the right tile size combined with the parallelization of loops can provide additional performance increases in the modern of Chip MultiProcessor (CMP) architectures. This paper presents a runtime optimization system which automatically parallelizes loops and searches empirically for the best tile sizes on a scalable multi-cluster CMP. The system is built on top of a virtual machine and targets the runtime parallelization and optimization of Java programs. Experimental results show that runtime parallelization and tile size searching are capable of improving performance for two BLAS kernels and one Lattice-Boltzmann simulation, despite overheads.