Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors

Authors:
Anant Agarwal;David A. Kranz;Venkat Natarajan
Affiliations:
-;-;-
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1995

Citing 19
Cited 35

Theory of linear and integer programming

Theory of linear and integer programming
Guided self-scheduling: A practical scheduling scheme for parallel supercomputers

IEEE Transactions on Computers
Strategies for cache and local memory management by global program transformation

Journal of Parallel and Distributed Computing - Special Issue on Languages, Compilers and environments for Parallel Programming
Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm

PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Tiling multidimensional iteration spaces for nonshared memory machines

Proceedings of the 1991 ACM/IEEE conference on Supercomputing
Global optimizations for parallelism and locality on scalable parallel machines

PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
The MIT Alewife machine: architecture and performance

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Footprints in the cache

SIGMETRICS '86/PERFORMANCE '86 Proceedings of the 1986 ACM SIGMETRICS joint international conference on Computer performance modelling, measurement and evaluation
Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs

IEEE Transactions on Parallel and Distributed Systems
Compile-Time Partitioning of Iterative Parallel Loops to Reduce Cache Coherency Traffic

IEEE Transactions on Parallel and Distributed Systems
Compile-Time Techniques for Data Distribution in Distributed Memory Machines

IEEE Transactions on Parallel and Distributed Systems
Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers

IEEE Transactions on Parallel and Distributed Systems
Hierarchical Compilation of Macro Dataflow Graphs for Multiprocessors with Local Memory

IEEE Transactions on Parallel and Distributed Systems
M-Structures: Extending a Parallel, Non-strict, Functional Language with State

Proceedings of the 5th ACM Conference on Functional Programming Languages and Computer Architecture
On Estimating and Enhancing Cache Effectiveness

Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing
Memory Assignment for Multiprocessor Caches through Grey Coloring

PARLE '94 Proceedings of the 6th International PARLE Conference on Parallel Architectures and Languages Europe
Global Partitioning of Parallel loops and Data Arrays for Caches and Distributed Memory in Multiprocessors

Global Partitioning of Parallel loops and Data Arrays for Caches and Distributed Memory in Multiprocessors

An Efficient Solution to the Cache Thrashing Problem Caused by True Data Sharing

IEEE Transactions on Computers
A preprocessing step for global loop transformations for data transfer optimization

CASES '00 Proceedings of the 2000 international conference on Compilers, architecture, and synthesis for embedded systems
Optimal partitioning and balanced scheduling with the maximal overlap of data footprints

GLSVLSI '01 Proceedings of the 11th Great Lakes symposium on VLSI
Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
Data and memory optimization techniques for embedded systems

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Optimal semi-oblique tiling

Proceedings of the thirteenth annual ACM symposium on Parallel algorithms and architectures
Contention elimination by replication of sequential sections in distributed shared memory programs

PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
A framework for performance-based program partitioning

Progress in computer research
Data Relation Vectors: A New Abstraction for Data Optimizations

IEEE Transactions on Computers - Special issue on the parallel architecture and compilation techniques conference
Automatic data and computation decomposition on distributed memory parallel computers

ACM Transactions on Programming Languages and Systems (TOPLAS)
Automatic Partitioning of Parallel Loops with Parallelepiped-Shaped Tiles

IEEE Transactions on Parallel and Distributed Systems
A framework for performance-based program partitioning

Progress in computer research
Energy/power estimation of regular processor arrays

Proceedings of the 15th international symposium on System Synthesis
Code Transformations for Data Transfer and Storage Exploration Preprocessing in Multimedia Processors

IEEE Design & Test
Cluster Computation for Flood Simulations

HPCN Europe 2001 Proceedings of the 9th International Conference on High-Performance Computing and Networking
Interaction Between Data Parallel Compilation and Data Transfer and Storage Cost Minimization for Multimedia Applications

Euro-Par '99 Proceedings of the 5th International Euro-Par Conference on Parallel Processing
Ground Water Flow Modelling in PVM

Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Interaction between parallel compilation and data transfer and storage cost minimization for multimedia applications

Practical parallel computing
Optimal task scheduling at run time to exploit intra-tile parallelism

Parallel Computing
High-level synthesis of distributed logic-memory architectures

Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design
Synthesis of Heterogeneous Distributed Architectures for Memory-Intensive Applications

Proceedings of the 2003 IEEE/ACM international conference on Computer-aided design
Transformation synthesis for data intensive applications to FPGAs

GLSVLSI '06 Proceedings of the 16th ACM Great Lakes symposium on VLSI
Partitioning and scheduling DSP applications with maximal memory access hiding

EURASIP Journal on Applied Signal Processing
Generation of heterogeneous distributed architectures for memory-intensive applications through high-level synthesis

IEEE Transactions on Very Large Scale Integration (VLSI) Systems
Positivity, posynomials and tile size selection

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Parallelizing query optimization

Proceedings of the VLDB Endowment
Dependency-aware reordering for parallelizing query optimization in multi-core CPUs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Optimizing shared cache behavior of chip multiprocessors

Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
Automatic memory partitioning and scheduling for throughput and power optimization

Proceedings of the 2009 International Conference on Computer-Aided Design
Iterational retiming with partitioning: Loop scheduling with complete memory latency hiding

ACM Transactions on Embedded Computing Systems (TECS)
Automatic memory partitioning and scheduling for throughput and power optimization

ACM Transactions on Design Automation of Electronic Systems (TODAES)
Optimizing explicit data transfers for data parallel applications on the cell architecture

ACM Transactions on Architecture and Code Optimization (TACO) - HIPEAC Papers
Memory partitioning and scheduling co-optimization in behavioral synthesis

Proceedings of the International Conference on Computer-Aided Design
Semi-automatic restructuring of offloadable tasks for many-core accelerators

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimizing two-dimensional DMA transfers for scratchpad Based MPSoCs platforms

Microprocessors & Microsystems

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We have implemented this framework in a compiler for Alewife, a distributed shared-memory multiprocessor.