Pipelined scheduling of tiled nested loops onto clusters of SMPs using memory mapped network interfaces

Authors:
Maria Athanasaki;Aristidis Sotiropoulos;Georgios Tsoukalas;Nectarios Koziris
Affiliations:
National Technical University of Athens, Computing Systems Laboratory;National Technical University of Athens, Computing Systems Laboratory;National Technical University of Athens, Computing Systems Laboratory;National Technical University of Athens, Computing Systems Laboratory
Venue:
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Year:
2002

Citing 14
Cited 2

Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Scanning polyhedra with DO loops

PPOPP '91 Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming
Computer organization & design: the hardware/software interface

Computer organization & design: the hardware/software interface
Software overhead in messaging layers: where does the time go?

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
U-Net: a user-level network interface for parallel and distributed computing

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Network interface for protected, user-level communication

Network interface for protected, user-level communication
The design and implementation of zero copy MPI using commodity hardware with a high performance network

ICS '98 Proceedings of the 12th international conference on Supercomputing
Optimal scheduling for UET/VET-UCT generalized n-dimensional grid task graphs

Journal of Parallel and Distributed Computing
Automatic code generation for executing tiled nested loops onto parallel architectures

Proceedings of the 2002 ACM symposium on Applied computing
On Supernode Transformation with Minimized Total Running Time

IEEE Transactions on Parallel and Distributed Systems
Enhancing the Performance of Tiled Loop Execution onto Clusters Using Memory Mapped Network Interfaces and Pipelined Schedules

IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
VIA over SCI: Consequences of a Zero Copy Implementation and Comparison with VIA over Myrinet

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
The SCI Standard and Applications of SCI

SCI: Scalable Coherent Interface, Architecture and Software for High-Performance Compute Clusters

Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs

The Journal of Supercomputing
The Effect of Process Topology and Load Balancing on Parallel Programming Models for SMP Clusters and Iterative Algorithms

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. We present a novel, pipelined scheduling approach which takes advantage of DMA communication mode, to send data to other nodes, while the CPUs are performing calculations. We also use zero-copy communication through pinned-down physical memory regions, provided by NIC's driver modules. Our testbed concerns the parallel execution of tiled nested loops onto a cluster of SMP nodes with single PCI-SCI NICs inside each node. In order to schedule tiles, we apply a hyperplane-based grouping transformation to the tiled space, so as to group together independent neighboring tiles and assign them to the same SMP node. Experimental evaluation illustrates that memory mapped NICs with enhanced communication features enable the use of a more advanced pipelined (overlapping) schedule, which considerably improves performance, compared to an ordinary blocking schedule, implemented with conventional, CPU and kernel bounded, communication primitives.