Dynamic pointer alignment: tiling and communication optimizations for parallel pointer-based computations

Authors:
Xingbin Zhang;Andrew A. Chien
Affiliations:
Department of Computer Science, University of Illinois at Urbana-Champaign;Department of Computer Science, University of Illinois at Urbana-Champaign and Hewlett Packard Laboratories
Venue:
PPOPP '97 Proceedings of the sixth ACM SIGPLAN symposium on Principles and practice of parallel programming
Year:
1997

Citing 29
Cited 2

Advanced compiler optimizations for supercomputers

Communications of the ACM - Special issue on parallelism
The program dependence graph and its use in optimization

ACM Transactions on Programming Languages and Systems (TOPLAS)
Run-time scheduling and execution of loops on message passing machines

Journal of Parallel and Distributed Computing - Special issue: algorithms for hypercube computers
Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Efficiently computing static single assignment form and the control dependence graph

ACM Transactions on Programming Languages and Systems (TOPLAS)
Compiling Fortran D for MIMD distributed-memory machines

Communications of the ACM
Global analysis for partitioning non-strict programs into sequential threads

LFP '92 Proceedings of the 1992 ACM conference on LISP and functional programming
A general framework for iteration-reordering loop transformations

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
A parallel hashed Oct-Tree N-body algorithm

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
A parallel adaptive fast multipole method

Proceedings of the 1993 ACM/IEEE conference on Supercomputing
Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors

ICS '94 Proceedings of the 8th international conference on Supercomputing
Precise concrete type inference for object-oriented languages

OOPSLA '94 Proceedings of the ninth annual conference on Object-oriented programming systems, language, and applications
Compiler optimizations for improving data locality

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Software caching and computation migration in Olden

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Commutativity analysis: a new analysis framework for parallelizing compilers

PLDI '96 Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation
Thread scheduling for cache locality

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Compiler-based prefetching for recursive data structures

Proceedings of the seventh international conference on Architectural support for programming languages and operating systems
Is it a tree, a DAG, or a cyclic graph? A shape analysis for heap-directed pointers in C

POPL '96 Proceedings of the 23rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Automatic inline allocation of objects

Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation
Application-specific protocols for user-level shared memory

Proceedings of the 1994 ACM/IEEE conference on Supercomputing
Parallelizing Programs with Recursive Data Structures

IEEE Transactions on Parallel and Distributed Systems
Compiler-Controlled Multithreading for Lenient Parallel Languages

Proceedings of the 5th ACM Conference on Functional Programming Languages and Computer Architecture
Compositional C++: Compositional Parallel Programming

Proceedings of the 5th International Workshop on Languages and Compilers for Parallel Computing
Type Directed Cloning for Object-Oriented Programs

LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
Connection Analysis: A Practical Interprocedural Heap Analysis for C

LCPC '95 Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing
ICC++-AC++ Dialect for High Performance Parallel Computing

ISOTAS '96 Proceedings of the Second JSSST International Symposium on Object Technologies for Advanced Software
Supporting High Level Programming with High Performance: The Illinois Concert System

HIPS '97 Proceedings of the 1997 Workshop on High-Level Programming Models and Supportive Environments (HIPS '97)

Automatic compiler techniques for thread coarsening for multithreaded architectures

Proceedings of the 14th international conference on Supercomputing
High-Level Parallel Programming of an Adaptive Mesh Application Using the Illinois Concert System

ISCOPE '98 Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environments

Quantified Score

Hi-index	0.00

Visualization

Abstract

Loop tiling and communication optimization, such as message pipelining and aggregation, can achieve optimized and robust memory performance by proactively managing storage and data movement. In this paper, we generalize these techniques to pointer-based data structures (PBDSs). Our approach, dynamic pointer alignment (DPA), has two components. The compiler decomposes a program into non-blocking threads that operate on specific pointers and labels thread creation sites with their corresponding pointers. At runtime, an explicit mapping from pointers to dependent threads is updated at thread creation and is used to dynamically schedule both threads and communication, such that threads using the same objects execute together, communication overlaps with local work, and messages are aggregated. We have implemented DPA to optimize remote reads to global PBDSs on parallel machines. Our empirical results on the force computation phases of two applications that use sophisticated PBDSs, Barnes-Hut and FMM, show that DPA achieves good absolute performance and speedups by enabling tiling and communication optimization on the CRAY T3D.