Advanced compiler optimizations for supercomputers
Communications of the ACM - Special issue on parallelism
The cache performance and optimizations of blocked algorithms
ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
A data locality optimizing algorithm
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Independent Partitioning of Algorithms with Uniform Dependencies
IEEE Transactions on Computers
Numerical recipes in C (2nd ed.): the art of scientific computing
Numerical recipes in C (2nd ed.): the art of scientific computing
Communication optimization and code generation for distributed memory machines
PLDI '93 Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation
Integration, the VLSI Journal
Numerical solution of partial differential equations
Numerical solution of partial differential equations
Advanced compilation techniques in the PARADIGM compiler for distributed-memory multicomputers
ICS '95 Proceedings of the 9th international conference on Supercomputing
Communication-minimal tiling of uniform dependence loops
Journal of Parallel and Distributed Computing
Determining the idle time of a tiling
Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
On the Removal of Anti- and Output-Dependences
International Journal of Parallel Programming
Parallel programming: techniques and applications using networked workstations and parallel computers
Selecting tile shape for minimal execution time
Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Optimal scheduling for UET/VET-UCT generalized n-dimensional grid task graphs
Journal of Parallel and Distributed Computing
Generating efficient tiled code for distributed memory machines
Parallel Computing
Time-minimal tiling when rise is larger than zero
Parallel Computing
A Loop Transformation Theory and an Algorithm to Maximize Parallelism
IEEE Transactions on Parallel and Distributed Systems
Partitioning and Labeling of Loops by Unimodular Transformations
IEEE Transactions on Parallel and Distributed Systems
On Supernode Transformation with Minimized Total Running Time
IEEE Transactions on Parallel and Distributed Systems
On Time Optimal Supernode Shape
IEEE Transactions on Parallel and Distributed Systems
Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
On the Parallel Execution Time of Tiled Loops
IEEE Transactions on Parallel and Distributed Systems
Scientific Programming
An efficient code generation technique for tiled iteration spaces
IEEE Transactions on Parallel and Distributed Systems
Compiling affine loop nests for distributed-memory parallel architectures
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Tiling is a well known loop transformation used to reduce communication overhead in distributed memory machines. Although a lot of theoretical research has been done concerning the selection of proper tile shapes that reduce processor idle times, there is no complete approach to automatically parallelize non-rectangularly tiled iteration spaces and consequently there are no actual experimental results to verify previous theoretical work on the effect of the tile shape on the overall completion time of a tiled algorithm. This paper presents a complete end-to-end framework to generate automatic message-passing code for tiled iteration spaces. It considers general parallelepiped tiling transformations and convex iteration spaces. We aim to address all problems concerning data parallel code generation efficiently by transforming the initial non-rectangular tile to a rectangular one. In this way, data distribution and the respective communication pattern become simple and straightforward. We have implemented our parallelizing techniques in a tool which automatically generates MPI code and run several benchmarks on a cluster of PCs. Our experimental results show the merit of general parallelepiped tiling transformations, and verify previous theoretical work on scheduling-optimal, non-rectangular tile shapes.