Enhancing the Performance of Tiled Loop Execution onto Clusters Using Memory Mapped Network Interfaces and Pipelined Schedules

Authors:
Aristidis Sotiropoulos;Georgios Tsoukalas;Nectarios Koziris
Affiliations:
-;-;-
Venue:
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Year:
2002

Citing 15
Cited 2

Supernode partitioning

POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Computer organization & design: the hardware/software interface

Computer organization & design: the hardware/software interface
(Pen)-ultimate tiling?

Integration, the VLSI Journal
Software overhead in messaging layers: where does the time go?

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
U-Net: a user-level network interface for parallel and distributed computing

SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Network interface for protected, user-level communication

Network interface for protected, user-level communication
Communication-minimal tiling of uniform dependence loops

Journal of Parallel and Distributed Computing
The design and implementation of zero copy MPI using commodity hardware with a high performance network

ICS '98 Proceedings of the 12th international conference on Supercomputing
Optimal scheduling for UET/VET-UCT generalized n-dimensional grid task graphs

Journal of Parallel and Distributed Computing
Design alternatives for virtual interface architecture and an implementation on IBM netfinity NT cluster

Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
On Supernode Transformation with Minimized Total Running Time

IEEE Transactions on Parallel and Distributed Systems
VIA over SCI: Consequences of a Zero Copy Implementation and Comparison with VIA over Myrinet

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Memory Management in a Combined VIA/SCI Hardware

IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
The SCI Standard and Applications of SCI

SCI: Scalable Coherent Interface, Architecture and Software for High-Performance Compute Clusters
Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping

IPDPS '01 Proceedings of the 15th International Parallel and Distributed Processing Symposium (IPDPS'01) - Volume 1

Pipelined scheduling of tiled nested loops onto clusters of SMPs using memory mapped network interfaces

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs

The Journal of Supercomputing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. Our experimental testbed concerns the parallel execution of tiled nested loops onto a Linux PC cluster with PCI-SCI NICs (Dolphin D330). Tiles are necessarily exchanging data and should also have large computational grain, so that their parallel execution becomes beneficial. We schedule tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. The applied nonblocking schedule resembles a pipelined datapath where computation phases are overlapped with communication ones, instead of being interleaved with them. We are using DMA communication mode, to remote write (send) data to other nodes, while the host CPU is computing all iterations within each tile. We achieve zero-copy communication through pinned-down physical memory regions for DMA (PCI exported segments to SCI global space). Results illustrate that when using enhanced communication features such as DMA transfers, memory-mapped interfaces and zero-copy mechanisms, overall performance is considerably enhanced than when typically using conventional, CPU and kernel bounded, communication primitives.