Compiler directed parallelization of loops in scale for shared-memory multiprocessors

Authors:
Gregory S. Johnson;Simha Sethumadhavan
Affiliations:
Department of Computer Sciences & Texas Advanced Computing Center, The University of Texas at Austin, Austin, TX;Department of Computer Sciences, The University of Texas at Austin, Austin, TX
Venue:
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
Year:
2003

Citing 6
Cited 0

A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Optimizing for parallelism and data locality

ICS '92 Proceedings of the 6th international conference on Supercomputing
Access normalization: loop restructuring for NUMA computers

ACM Transactions on Computer Systems (TOCS)
Improving data locality with loop transformations

ACM Transactions on Programming Languages and Systems (TOPLAS)
A Compiler Optimization Algorithm for Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
A Cost Framework for Evaluating Integrated Restructuring Optimizations

Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

Effective utilization of symmetric shared-memory multiprocessors (SMPs) is predicated on the development of efficient parallel code. Unfortunately, efficient parallelism is not always easy for the programmer to identify. Worse, exploiting such parallelism may directly conflict with optimizations affecting per-processor utilization (i.e. loop reordering to improve data locality). Here, we present our experience with a loop-level parallel compiler optimization for SMPs proposed by McKinley [6]. The algorithm uses dependence analysis and a simple model of the target machine, to transform nested loops. The goal of the approach is to promote efficient execution of parallel loops by exposing sources of large-grain parallel work while maintaining per-processor locality.We implement the optimization within the Scale compiler framework, and analyze the performance of multiprocessor code produced for three microbenchmarks.