POPL '88 Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages
Computer organization & design: the hardware/software interface
Computer organization & design: the hardware/software interface
Integration, the VLSI Journal
Software overhead in messaging layers: where does the time go?
ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
U-Net: a user-level network interface for parallel and distributed computing
SOSP '95 Proceedings of the fifteenth ACM symposium on Operating systems principles
Network interface for protected, user-level communication
Network interface for protected, user-level communication
Communication-minimal tiling of uniform dependence loops
Journal of Parallel and Distributed Computing
ICS '98 Proceedings of the 12th international conference on Supercomputing
Optimal scheduling for UET/VET-UCT generalized n-dimensional grid task graphs
Journal of Parallel and Distributed Computing
Journal of Parallel and Distributed Computing - Special issue on cluster and network-based computing
On Supernode Transformation with Minimized Total Running Time
IEEE Transactions on Parallel and Distributed Systems
VIA over SCI: Consequences of a Zero Copy Implementation and Comparison with VIA over Myrinet
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Memory Management in a Combined VIA/SCI Hardware
IPDPS '00 Proceedings of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing
The SCI Standard and Applications of SCI
SCI: Scalable Coherent Interface, Architecture and Software for High-Performance Compute Clusters
Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping
IPDPS '01 Proceedings of the 15th International Parallel and Distributed Processing Symposium (IPDPS'01) - Volume 1
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Hyperplane Grouping and Pipelined Schedules: How to Execute Tiled Loops Fast on Clusters of SMPs
The Journal of Supercomputing
Hi-index | 0.00 |
This paper describes the performance benefits attained using enhanced network interfaces to achieve low latency communication. Our experimental testbed concerns the parallel execution of tiled nested loops onto a Linux PC cluster with PCI-SCI NICs (Dolphin D330). Tiles are necessarily exchanging data and should also have large computational grain, so that their parallel execution becomes beneficial. We schedule tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. The applied nonblocking schedule resembles a pipelined datapath where computation phases are overlapped with communication ones, instead of being interleaved with them. We are using DMA communication mode, to remote write (send) data to other nodes, while the host CPU is computing all iterations within each tile. We achieve zero-copy communication through pinned-down physical memory regions for DMA (PCI exported segments to SCI global space). Results illustrate that when using enhanced communication features such as DMA transfers, memory-mapped interfaces and zero-copy mechanisms, overall performance is considerably enhanced than when typically using conventional, CPU and kernel bounded, communication primitives.