Automatic data layout for distributed-memory machines

Authors:
Ken Kennedy;Ulrich Kremer
Affiliations:
Rice Univ., Houston, TX;Rutgers Univ., Piscataway, NJ
Venue:
ACM Transactions on Programming Languages and Systems (TOPLAS)
Year:
1998

Citing 51
Cited 35

Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
The SCHEME programming language

The SCHEME programming language
Integer and combinatorial optimization

Integer and combinatorial optimization
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems

Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Process decomposition through locality of reference

PLDI '89 Proceedings of the ACM SIGPLAN 1989 Conference on Programming language design and implementation
A methodology for parallelizing programs for multicomputers and complex memory multiprocessors

Proceedings of the 1989 ACM/IEEE conference on Supercomputing
Data optimization: allocation of arrays to reduce communication on SIMD machines

Journal of Parallel and Distributed Computing - Massively parallel computation
Introduction to algorithms

Introduction to algorithms
A static performance estimator to guide data partitioning decisions

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
The data alignment phase in compiling programs for distributed-memory machines

Journal of Parallel and Distributed Computing
Automatic data mapping for distributed-memory parallel computers

Automatic data mapping for distributed-memory parallel computers
The Omega test: a fast and practical integer programming algorithm for dependence analysis

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Vienna Fortran—a Fortran language extension for distributed memory multiprocessors

Languages, compilers and run-time environments for distributed memory machines
Automatic data mapping for distributed-memory parallel computers

ICS '92 Proceedings of the 6th international conference on Supercomputing
Compiling crystal for distributed-memory machines

Compiling crystal for distributed-memory machines
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Automatic array alignment in data-parallel programs

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
A novel framework of register allocation for software pipelining

POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The high performance Fortran handbook

The high performance Fortran handbook
Automatic data partitioning on distributed memory multicomputers

Automatic data partitioning on distributed memory multicomputers
Automatic data allocation to minimize communication on SIMD machines

The Journal of Supercomputing
Minimizing register requirements under resource-constrained rate-optimal software pipelining

MICRO 27 Proceedings of the 27th annual international symposium on Microarchitecture
An optimizing Fortran D compiler for MIMD distributed-memory machines

An optimizing Fortran D compiler for MIMD distributed-memory machines
Optimal evaluation of array expressions on massively parallel machines

ACM Transactions on Programming Languages and Systems (TOPLAS)
Scheduling and mapping: software pipelining in the presence of structural hazards

PLDI '95 Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation
Automatic alignment of array data and processes to reduce communication time on DMPPs

PPOPP '95 Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic data layout for high performance Fortran

Supercomputing '95 Proceedings of the 1995 ACM/IEEE conference on Supercomputing
Minimizing communication while preserving parallelism

ICS '96 Proceedings of the 10th international conference on Supercomputing
Automatic data layout for distributed memory machines

Automatic data layout for distributed memory machines
Efficient distribution analysis via graph contraction

International Journal of Parallel Programming - Special issue: selected papers from the eighth international workshop on languages and compilers for parallel computing
Compiler techniques for data partitioning of sequentially iterated parallel loops

ICS '90 Proceedings of the 4th international conference on Supercomputing
Dynamic data distribution with control flow analysis

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Flow Analysis of Computer Programs

Flow Analysis of Computer Programs
Requirements for Data-Parallel Programming Environments

IEEE Parallel & Distributed Technology: Systems & Technology
Compiling Communication-Efficient Programs for Massively Parallel Machines

IEEE Transactions on Parallel and Distributed Systems
Compiling Global Name-Space Parallel Loops for Distributed Execution

IEEE Transactions on Parallel and Distributed Systems
Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Automatic Partitioning of Data and Computations on Scalable Shared Memory Multiprocessors

ICPP '97 Proceedings of the international Conference on Parallel Processing
Interprocedural Array Redistribution Data-Flow Analysis

LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
Automatic Data Layout with Read-Only Replication and Memory Constraints

LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
Automatic Data Decomposition for Message-Passing Machines

LCPC '97 Proceedings of the 10th International Workshop on Languages and Compilers for Parallel Computing
Communication-Free Hyperplane Partitioning of Nested Loops

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Array Distribution in Data-Parallel Programs

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Fine-Grain Scheduling under Resource Constraints

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Solving Alignment Using Elementary Linear Algebra

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Detecting and Using Affinity in an Automatic Data Distribution Tool

LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Automatic Data Layout Using 0-1 Integer Programming

PACT '94 Proceedings of the IFIP WG10.3 Working Conference on Parallel Architectures and Compilation Techniques
An Overview of the Fortran D Programming System

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Automatic data and computation decomposition for distributed memory machines

HICSS '95 Proceedings of the 28th Hawaii International Conference on System Sciences
Compiler techniques for optimizing communication and data distribution for distributed-memory multicomputers

Compiler techniques for optimizing communication and data distribution for distributed-memory multicomputers
Automatic computation and data decomposition for multiprocessors

Automatic computation and data decomposition for multiprocessors

Nonlinear array layouts for hierarchical memory systems

ICS '99 Proceedings of the 13th international conference on Supercomputing
Recursive array layouts and fast parallel matrix multiplication

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Accurate data redistribution cost estimation in software distributed shared memory systems

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Automatic data and computation decomposition on distributed memory parallel computers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Integrating loop and data transformations for global optimization

Journal of Parallel and Distributed Computing
Recursive Array Layouts and Fast Matrix Multiplication

IEEE Transactions on Parallel and Distributed Systems
Segmented Alignment: An Enhanced Model to Align Data Parallel Programs of HPF

The Journal of Supercomputing
Fortran RED - A Retargetable Environment for Automatic Data Layout

LCPC '98 Proceedings of the 11th International Workshop on Languages and Compilers for Parallel Computing
Compiler-Directed Dynamic Frequency and Voltage Scheduling

PACS '00 Proceedings of the First International Workshop on Power-Aware Computer Systems-Revised Papers
Energy management of virtual memory on diskless devices

Compilers and operating systems for low power
Improving effective bandwidth through compiler enhancement of global cache reuse

Journal of Parallel and Distributed Computing
Array regrouping and structure splitting using whole-program reference affinity

Proceedings of the ACM SIGPLAN 2004 conference on Programming language design and implementation
Support and optimization for parallel sparse programs with array intrinsics of Fortran 90

Parallel Computing
Quasidynamic Layout Optimizations for Improving Data Locality

IEEE Transactions on Parallel and Distributed Systems
Dyn-MPI: Supporting MPI on Non Dedicated Clusters

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Using multiple energy gears in MPI programs on a power-scalable cluster

Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming
Automatic array partitioning based on the Smith normal form

International Journal of Parallel Programming
Lightweight reference affinity analysis

Proceedings of the 19th annual international conference on Supercomputing
The MHETA Execution Model for Heterogeneous Clusters

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Just In Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs

SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
A hierarchical model of data locality

Conference record of the 33rd ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Minimizing execution time in MPI programs on an energy-constrained, power-scalable cluster

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming
The hardness of cache conscious data placement

Nordic Journal of Computing
Dyn-MPI: Supporting MPI on medium-scale, non-dedicated clusters

Journal of Parallel and Distributed Computing
The rise and fall of High Performance Fortran: an historical object lesson

Proceedings of the third ACM SIGPLAN conference on History of programming languages
Analyzing the Energy-Time Trade-Off in High-Performance Computing Applications

IEEE Transactions on Parallel and Distributed Systems
Just-in-time dynamic voltage scaling: Exploiting inter-node slack to save energy in MPI programs

Journal of Parallel and Distributed Computing
Global Tiling for Communication Minimal Parallelization on Distributed Memory Systems

Euro-Par '08 Proceedings of the 14th international Euro-Par conference on Parallel Processing
Applying Data Mapping Techniques to Vector DSPs

Journal of Signal Processing Systems
Dynamic thread and data mapping for NoC based CMPs

Proceedings of the 46th Annual Design Automation Conference
Compiler-assisted data distribution for chip multiprocessors

Proceedings of the 19th international conference on Parallel architectures and compilation techniques
Data layout transformation for stencil computations on short-vector SIMD architectures

CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Parallelism and data movement characterization of contemporary application classes

Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Optimization of dense matrix multiplication on IBM cyclops-64: challenges and experiences

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Compiling affine loop nests for distributed-memory parallel architectures

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Quantified Score

Hi-index	0.00

Visualization

Abstract

The goal of languages like Fortran D or High Performance Fortran (HPF) is to provide a simple yet efficient machine-independent parallel programming model. After the algorithm selection, the data layout choice is the key intellectual challenge in writing an efficient program in such languages. The performance of a data layout depends on the target compilation system, the target machine, the problem size, and the number of available processors. This makes the choice of a good layout extremely difficult for most users of such languages. If languages such as HPF are to find general acceptance, the need for data layout selection support has to be addressed. We beleive that the appropriate way to provide the needed support is through a tool that generates data layout specifications automatically. This article discusses the design and implementation of a data layout selection tool that generates HPF-style data layout specifications automatically. Because layout is done in a tool that is not embedded in the target compiler and hence will be run only a few times during the tuning phase of an application, it can use techniques such as integer programming that may be considered too computationally expensive for inclusion in production compilers. The proposed framework for automatic data layout selection builds and examines search spaces of candidate data layouts. A candidate layout is an efficient layout for some part of the program. After the generation of search spaces, a single candidate layout is selected for each program part, resulting in a data layout for the entire program. A good overall data layout may require the remapping of arrays between program parts. A performance estimator based on a compiler model, an execution model, and a machine model are needed to predict the execution time of each candidate layout and the costs of possible remappings between candidate data layouts. In the proposed framework, instances of NP-complete problems are solved during the construction of candidate layout search spaces and the final selection of candidate layouts from each search space. Rather than resorting to heuristics, the framework capitalizes on state-of-the-art 0-1 integer programming technology to compute optimal solutions of these NP-complete problems. A prototype data layout assistant tool based on our framework has been implemented as part of the D system currently under development at Rice University. The article reports preliminary experimental results. The results indicate that the framework is efficient and allows the generation of data layouts of high quality.