Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs

Authors:
Maria Athanasaki;Aristidis Sotiropoulos;Georgios Tsoukalas;Nectarios Koziris;Panayiotis Tsanakas
Affiliations:
School of Electrical and Computer Engineering, Computing Systems Laboratory, National Technical University of Athens, Zografou, Greece 15773;School of Electrical and Computer Engineering, Computing Systems Laboratory, National Technical University of Athens, Zografou, Greece 15773;School of Electrical and Computer Engineering, Computing Systems Laboratory, National Technical University of Athens, Zografou, Greece 15773;School of Electrical and Computer Engineering, Computing Systems Laboratory, National Technical University of Athens, Zografou, Greece 15773;School of Electrical and Computer Engineering, Computing Systems Laboratory, National Technical University of Athens, Zografou, Greece 15773
Venue:
The Journal of Supercomputing
Year:
2005

Citing 42
Cited 2

Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
The cache performance and optimizations of blocked algorithms

ASPLOS IV Proceedings of the fourth international conference on Architectural support for programming languages and operating systems
Scanning polyhedra with DO loops

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Computer organization & design: the hardware/software interface

Computer organization & design: the hardware/software interface
(Pen)-ultimate tiling?

Integration, the VLSI Journal
Software overhead in messaging layers: where does the time go?

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Partitioning and mapping of nested loops for linear array multicomputers

The Journal of Supercomputing - Special issue: trends in parallel operating systems
U-Net: a user-level network interface for parallel and distributed computing

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Network interface for protected, user-level communication

Network interface for protected, user-level communication
Parallel stereocorrelation on a reconfigurable multi-ring network

The Journal of Supercomputing - Special issue on parallel and distributed processing
Parallel Computer Vision on a Reconfigurable Multiprocessor Network

IEEE Transactions on Parallel and Distributed Systems
Communication-minimal tiling of uniform dependence loops

Journal of Parallel and Distributed Computing
Determining the idle time of a tiling

Proceedings of the 24th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Effects of communication latency, overhead, and bandwidth in a cluster architecture

Proceedings of the 24th annual international symposium on Computer architecture
Modeling communication pipeline latency

SIGMETRICS '98/PERFORMANCE '98 Proceedings of the 1998 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems
The design and implementation of zero copy MPI using commodity hardware with a high performance network

ICS '98 Proceedings of the 12th international conference on Supercomputing
Improving Cache Locality by a Combination of Loop and Data Transformations

IEEE Transactions on Computers - Special issue on cache memory and related problems
Using network interface support to avoid asynchronous protocol processing in shared virtual memory systems

ISCA '99 Proceedings of the 26th annual international symposium on Computer architecture
Selecting tile shape for minimal execution time

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures
Optimal scheduling for UET/VET-UCT generalized n-dimensional grid task graphs

Journal of Parallel and Distributed Computing
Chain Grouping: A Method for Partitioning Loops onto Mesh-Connected Processor Arrays

IEEE Transactions on Parallel and Distributed Systems
Exploiting Wavefront Parallelism on Large-Scale Shared-Memory Multiprocessors

IEEE Transactions on Parallel and Distributed Systems
User-space communication: a quantitative study

SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
EMP: zero-copy OS-bypass NIC-driven gigabit ethernet message passing

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
Pipelined Data Parallel Algorithms-II: Design

IEEE Transactions on Parallel and Distributed Systems
Partitioning and Mapping Nested Loops on Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
On Supernode Transformation with Minimized Total Running Time

IEEE Transactions on Parallel and Distributed Systems
On Time Optimal Supernode Shape

IEEE Transactions on Parallel and Distributed Systems
Enhancing the Performance of Tiled Loop Execution onto Clusters Using Memory Mapped Network Interfaces and Pipelined Schedules

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
VIA over SCI: Consequences of a Zero Copy Implementation and Comparison with VIA over Myrinet

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
The SCI Standard and Applications of SCI

SCI: Scalable Coherent Interface, Architecture and Software for High-Performance Compute Clusters
Pipelined scheduling of tiled nested loops onto clusters of SMPs using memory mapped network interfaces

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
On the Parallel Execution Time of Tiled Loops

IEEE Transactions on Parallel and Distributed Systems
The Sensitivity of Communication Mechanisms to Bandwidth and Latency

HPCA '98 Proceedings of the 4th International Symposium on High-Performance Computer Architecture
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping

ICPPW '02 Proceedings of the 2002 International Conference on Parallel Processing Workshops
Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping

IPDPS '01 Proceedings of the 15th International Parallel and Distributed Processing Symposium (IPDPS'01) - Volume 1
Evaluation of Loop Grouping Methods Based on Orthogonal Projection Spaces

ICPP '00 Proceedings of the Proceedings of the 2000 International Conference on Parallel Processing
Pin-down Cache: A Virtual Memory Management Technique for Zero-copy Communication

IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Tiling, Block Data Layout, and Memory Hierarchy Performance

IEEE Transactions on Parallel and Distributed Systems
A Geometric Programming Framework for Optimal Multi-Level Tiling

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
An efficient code generation technique for tiled iteration spaces

IEEE Transactions on Parallel and Distributed Systems

Parallel loop generation and scheduling

The Journal of Supercomputing
Automatic code generation for distributed memory architectures in the polytope model

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper proposes a novel approach for the parallel execution of tiled Iteration Spaces onto a cluster of SMP PC nodes. Each SMP node has multiple CPUs and a single memory mapped PCI-SCI Network Interface Card. We apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. In this way, intranode (intragroup) communication is annihilated. Groups are atomically executed inside each node. Nodes exchange data between successive group computations. We schedule groups much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive atomic group executions. The applied non-blocking schedule resembles a pipelined datapath, where group computation phases are overlapped with communication ones, instead of being interleaved with them. Our experimental results illustrate that the proposed method outperforms previous approaches involving blocking communication or conventional grouping schemes.