Adaptive Loop Tiling for a Multi-cluster CMP

Authors:
Jisheng Zhao;Matthew Horsnell;Mikel Luján;Ian Rogers;Chris Kirkham;Ian Watson
Affiliations:
University of Manchester, UK;University of Manchester, UK;University of Manchester, UK;University of Manchester, UK;University of Manchester, UK;University of Manchester, UK
Venue:
ICA3PP '08 Proceedings of the 8th international conference on Algorithms and Architectures for Parallel Processing
Year:
2008

Citing 13
Cited 3

The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Compiler blockability of numerical algorithms

Proceedings of the 1992 ACM/IEEE conference on Supercomputing
Tile size selection using cache organization and data layout

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
The Jalapeño dynamic optimizing compiler for Java

JAVA '99 Proceedings of the ACM 1999 conference on Java Grande
LAPACK Users' guide (third ed.)

LAPACK Users' guide (third ed.)
Adaptive optimization in the Jalapeño JVM

OOPSLA '00 Proceedings of the 15th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications
High-level adaptive program optimization with ADAPT

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
High Performance Compilers for Parallel Computing

High Performance Compilers for Parallel Computing
The Stanford Hydra CMP

IEEE Micro
Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation

PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
Minimizing development and maintenance costs in supporting persistently optimized BLAS

Software—Practice & Experience - Research Articles
Niagara: A 32-Way Multithreaded Sparc Processor

IEEE Micro
Optimizing chip multiprocessor work distribution using dynamic compilation

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing

Dynamic selection of implementation variants of sequential iterated runge-kutta methods with tile size sampling

Proceedings of the 2nd ACM/SPEC International Conference on Performance engineering
An efficient time-step-based self-adaptive algorithm for predictor-corrector methods of Runge-Kutta type

Journal of Computational and Applied Mathematics
On-chip cache hierarchy-aware tile scheduling for multicore machines

CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization

Quantified Score

Hi-index	0.00

Visualization

Abstract

Loop tiling is a fundamental optimization for improving data locality. Selecting the right tile size combined with the parallelization of loops can provide additional performance increases in the modern of Chip MultiProcessor (CMP) architectures. This paper presents a runtime optimization system which automatically parallelizes loops and searches empirically for the best tile sizes on a scalable multi-cluster CMP. The system is built on top of a virtual machine and targets the runtime parallelization and optimization of Java programs. Experimental results show that runtime parallelization and tile size searching are capable of improving performance for two BLAS kernels and one Lattice-Boltzmann simulation, despite overheads.