Enhancing the Performance of Tiled Loop Execution onto Clusters Using Memory Mapped Network Interfaces and Pipelined Schedules

  • Authors:
  • Aristidis Sotiropoulos;Georgios Tsoukalas;Nectarios Koziris

  • Affiliations:
  • -;-;-

  • Venue:
  • IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. Our experimental testbed concerns the parallel execution of tiled nested loops onto a Linux PC cluster with PCI-SCI NICs (Dolphin D330). Tiles are necessarily exchanging data and should also have large computational grain, so that their parallel execution becomes beneficial. We schedule tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. The applied nonblocking schedule resembles a pipelined datapath where computation phases are overlapped with communication ones, instead of being interleaved with them. We are using DMA communication mode, to remote write (send) data to other nodes, while the host CPU is computing all iterations within each tile. We achieve zero-copy communication through pinned-down physical memory regions for DMA (PCI exported segments to SCI global space). Results illustrate that when using enhanced communication features such as DMA transfers, memory-mapped interfaces and zero-copy mechanisms, overall performance is considerably enhanced than when typically using conventional, CPU and kernel bounded, communication primitives.