Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays
IEEE Transactions on Computers
Automatic translation of FORTRAN programs to vector form
ACM Transactions on Programming Languages and Systems (TOPLAS)
The derivation of systolic implementations
Acta Informatica
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Solving problems on concurrent processors. Vol. 1: General techniques and regular problems
Compiler Optimizations for Enhancing Parallelism and Their Impact on Architecture Design
IEEE Transactions on Computers - Special issue on architectural support for programming languages and operating systems
Synthesizing Linear Array Algorithms from Nested FOR Loop Algorithms
IEEE Transactions on Computers
POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Supercompilers for parallel and vector computers
Supercompilers for parallel and vector computers
On high-speed computing with a programmable linear array
The Journal of Supercomputing
Generating explicit communication from shared-memory program references
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
The data alignment phase in compiling programs for distributed-memory machines
Journal of Parallel and Distributed Computing
Compiling Fortran D for MIMD distributed-memory machines
Communications of the ACM
A methodology for high-level synthesis of communication on multicomputers
ICS '92 Proceedings of the 6th international conference on Supercomputing
Automatic array alignment in data-parallel programs
POPL '93 Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The high performance Fortran handbook
The high performance Fortran handbook
Communication-free hyperplane partitioning of nested loops
Journal of Parallel and Distributed Computing
Integration, the VLSI Journal
Techniques for compiling programs on distributed memory multicomputers
Parallel Computing
Improving data locality with loop transformations
ACM Transactions on Programming Languages and Systems (TOPLAS)
Automatic data layout for distributed memory machines
Automatic data layout for distributed memory machines
Efficient Algorithms for Data Distribution on Distributed Memory Parallel Computers
IEEE Transactions on Parallel and Distributed Systems
Communication-minimal tiling of uniform dependence loops
Journal of Parallel and Distributed Computing
Optimization and transformation techniques for high performance Fortran
Optimization and transformation techniques for high performance Fortran
Maximizing parallelism and minimizing synchronization with affine partitions
Parallel Computing - Special issues on languages and compilers for parallel computers
Automatic data layout for distributed-memory machines
ACM Transactions on Programming Languages and Systems (TOPLAS)
An affine partitioning algorithm to maximize parallelism and minimize communication
ICS '99 Proceedings of the 13th international conference on Supercomputing
Selecting tile shape for minimal execution time
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
High performance Fortran compilation techniques for parallelizing scientific codes
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
A Systolic Array Parallelizing Compiler
A Systolic Array Parallelizing Compiler
Advanced Computer Architecture: Parallelism,Scalability,Programmability
Advanced Computer Architecture: Parallelism,Scalability,Programmability
High Performance Compilers for Parallel Computing
High Performance Compilers for Parallel Computing
The Generation of a Class of Multipliers: Synthesizing Highly Parallel Algorithms in VLSI
IEEE Transactions on Computers
Distributed Memory Compiler Design For Sparse Problems
IEEE Transactions on Computers
Mapping Nested Loop Algorithms into Multidimensional Systolic Arrays
IEEE Transactions on Parallel and Distributed Systems
Compiling Communication-Efficient Programs for Massively Parallel Machines
IEEE Transactions on Parallel and Distributed Systems
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Compile-Time Techniques for Data Distribution in Distributed Memory Machines
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
On Time Mapping of Uniform Dependence Algorithms into Lower Dimensional Processor Arrays
IEEE Transactions on Parallel and Distributed Systems
Compiling for Distributed Memory Architectures
IEEE Transactions on Parallel and Distributed Systems
Communication-Free Data Allocation Techniques for Parallelizing Compilers on Multicomputers
IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems
On Supernode Transformation with Minimized Total Running Time
IEEE Transactions on Parallel and Distributed Systems
On Privatization of Variables for Data-Parallel Execution
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
Statement-Level Communication-Free Partitioning Techniques for Parallelizing Compilers
LCPC '96 Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing
Array Distribution in Data-Parallel Programs
LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Communication-Free Parallelization via Affine Transformations
LCPC '94 Proceedings of the 7th International Workshop on Languages and Compilers for Parallel Computing
Data Relation Vectors: A New Abstraction for Data Optimizations
PACT '00 Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques
A systolic array optimizing compiler
A systolic array optimizing compiler
Optimal communication primitives and graph embeddings on hypercubes
Optimal communication primitives and graph embeddings on hypercubes
Automatic generation of systolic programs from nested loops
Automatic generation of systolic programs from nested loops
Automatic data and computation mapping for distributed-memory machines
Automatic data and computation mapping for distributed-memory machines
Compiler techniques for optimizing communication and data distribution for distributed-memory multicomputers
Automatic computation and data decomposition for multiprocessors
Automatic computation and data decomposition for multiprocessors
Research note: Modeling distributed data representation and its effect on parallel data accesses
Journal of Parallel and Distributed Computing - Special issue: Design and performance of networks for super-, cluster-, and grid-computing: Part I
Choosing colors for geometric graphs via color space embeddings
GD'06 Proceedings of the 14th international conference on Graph drawing
Inferring arbitrary distributions for data and computation
Proceedings of the ACM international conference companion on Object oriented programming systems languages and applications companion
Partitioning applications for hybrid and federated clouds
CASCON '12 Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research
Hi-index | 0.00 |
To exploit parallelism on shared memory parallel computers (SMPCs), it is natural to focus on decomposing the computation (mainly by distributing the iterations of the nested Do-Loops). In contrast, on distributed memory parallel computers (DMPCs), the decomposition of computation and the distribution of data must both be handled---in order to balance the computation load and to minimize the migration of data. We propose and validate experimentally a method for handling computations and data synergistically to minimize the overall execution time on DMPCs. The method is based on a number of novel techniques, also presented in this article. The core idea is to rank the "importance" of data arrays in a program and specify some of the dominant. The intuition is that the dominant arrays are the ones whose migration would be the most expensive. Using the correspondence between iteration space mapping vectors and distributed dimensions of the dominant data array in each nested Do-loop, allows us to design algorithms for determining data and computation decompositions at the same time. Based on data distribution, computation decomposition for each nested Do-loop is determined based on either the "owner computes" rule or the "owner stores" rule with respect to the dominant data array. If all temporal dependence relations across iteration partitions are regular, we use tiling to allow pipelining and the overlapping of computation and communication. However, in order to use tiling on DMPCs, we needed to extend the existing techniques for determining tiling vectors and tile sizes, as they were originally suited for SMPCs only. The overall method is illustrated on programs for the 2D heat equation, for the Gaussian elimination with pivoting, and for the 2D fast Fourier transform on a linear processor array and on a 2D processor grid.