A pipelined schedule to minimize completion time for loop tiling with computation and communication overlapping

Authors:
N. Koziris;A. Sotiropoulos;G. Goumas
Affiliations:
Computing Systems Laboratory, Department of Electrical and Computer Engineering, National Technical University of Athens, Zografou Campus, 15773, Zografou, Greece;Computing Systems Laboratory, Department of Electrical and Computer Engineering, National Technical University of Athens, Zografou Campus, 15773, Zografou, Greece;Computing Systems Laboratory, Department of Electrical and Computer Engineering, National Technical University of Athens, Zografou Campus, 15773, Zografou, Greece
Venue:
Journal of Parallel and Distributed Computing
Year:
2003

Citing 18
Cited 2

Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Time Optimal Linear Schedules for Algorithms with Uniform Dependencies

IEEE Transactions on Computers
Independent Partitioning of Algorithms with Uniform Dependencies

IEEE Transactions on Computers
Computer organization & design: the hardware/software interface

Computer organization & design: the hardware/software interface
(Pen)-ultimate tiling?

Integration, the VLSI Journal
Software overhead in messaging layers: where does the time go?

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
U-Net: a user-level network interface for parallel and distributed computing

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Network interface for protected, user-level communication

Network interface for protected, user-level communication
Communication-minimal tiling of uniform dependence loops

Journal of Parallel and Distributed Computing
The design and implementation of zero copy MPI using commodity hardware with a high performance network

ICS '98 Proceedings of the 12th international conference on Supercomputing
Optimal scheduling for UET/VET-UCT generalized n-dimensional grid task graphs

Journal of Parallel and Distributed Computing
Chain Grouping: A Method for Partitioning Loops onto Mesh-Connected Processor Arrays

IEEE Transactions on Parallel and Distributed Systems
Design alternatives for virtual interface architecture and an implementation on IBM netfinity NT cluster

Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
Partitioning and Labeling of Loops by Unimodular Transformations

IEEE Transactions on Parallel and Distributed Systems
On Supernode Transformation with Minimized Total Running Time

IEEE Transactions on Parallel and Distributed Systems
VIA over SCI: Consequences of a Zero Copy Implementation and Comparison with VIA over Myrinet

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Memory Management in a Combined VIA/SCI Hardware

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
Evaluation of Loop Grouping Methods Based on Orthogonal Projection Spaces

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing

Optimizing communication overlap for high-speed networks

Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
Runtime optimization of vector operations on large scale SMP clusters

Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a new method for the problem of minimizing the execution time of nested for-loops using a tiling transformation. In our approach, we are interested not only in tile size and shape according to the required communication to computation ratio, but also in overall completion time. We select a time hyperplane to execute different tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. We assign tiles to processors according to the tile space boundaries, thus considering the iteration space bounds. Our schedule considerably reduces overall completion time under the assumption that some part from every communication phase can be efficiently overlapped with atomic, pure tile computations. The overall schedule resembles a pipelined datapath where computations are not anymore interleaved with sends and receives to nonlocal processors. We survey the application of our schedule to modern communication architectures. We performed two sets of experimental results, one using MPI primitives over FastEthernet and one using the SISCI API over an SCI network. In both cases, the total completion time is significantly reduced.