Optimizing memory system performance for communication in parallel computers

Authors:
T. Stricker;T. Gross
Affiliations:
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA;School of Computer Science, Carnegie Mellon University, Pittsburgh, PA and Institut fuer Computer Systeme, ETH Zuerich, CH 8092 Zuerich, Switzerland
Venue:
ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Year:
1995

Citing 7
Cited 10

Active messages: a mechanism for integrated communication and computation

ISCA '92 Proceedings of the 19th annual international symposium on Computer architecture
The network architecture of the Connection Machine CM-5 (extended abstract)

SPAA '92 Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures
Generating communication for array statements: design, implementation, and evaluation

Journal of Parallel and Distributed Computing - Special issue on data parallel algorithms and programming
An architecture for optimal all-to-all personalized communication

SPAA '94 Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures
AP1000+: architectural support of PUT/GET interface for parallelizing compiler

ASPLOS VI Proceedings of the sixth international conference on Architectural support for programming languages and operating systems
Task Parallelism in a High Performance Fortran Framework

IEEE Parallel & Distributed Technology: Systems & Technology
Measurement of Communication Rates on the Cray T3D Interprocessor Network

HPCN Europe 1994 Proceedings of the nternational Conference and Exhibition on High-Performance Computing and Networking Volume II: Networking and Tools

Decoupling synchronization and data transfer in message passing systems of parallel computers

ICS '95 Proceedings of the 9th international conference on Supercomputing
Fast message assembly using compact address relations

Proceedings of the 1996 ACM SIGMETRICS international conference on Measurement and modeling of computer systems
From AAPC algorithms to high performance permutation routing and sorting

Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures
Partition Cast - Modelling and Optimizing the Distribution of Large Data Sets in PC Clusters (Distinguished Paper)

Euro-Par '00 Proceedings from the 6th International Euro-Par Conference on Parallel Processing
Predicting and Evaluating Distributed Communication Performance

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
OS support for a commodity database on PC clusters: distributed devices vs. distributed file systems

ADC '05 Proceedings of the 16th Australasian database conference - Volume 39
$\log_{\rm n}{\rm P}$ and $\log_{3}{\rm P}$: Accurate Analytical Models of Point-to-Point Communication in Distributed Systems

IEEE Transactions on Computers
Dodging the cost of unavoidable memory copies in message logging protocols

EuroMPI'10 Proceedings of the 17th European MPI users' group meeting conference on Recent advances in the message passing interface
SymCall: symbiotic virtualization through VMM-to-guest upcalls

Proceedings of the 7th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments
Implementation and performance optimization of a parallel contour line generation algorithm

Computers & Geosciences

Quantified Score

Hi-index	0.00

Visualization

Abstract

Communication in a parallel system frequently involves moving data from the memory of one node to the memory of another; this is the standard communication model employed in message passing systems. Depending on the application, we observe a variety of patterns as part of communication steps, e.g., regular (i.e. blocks of data), strided, or irregular (indexed) memory accesses. The effective speed of these communication steps is determined by the network bandwidth and the memory bandwidth, and measurements on current parallel supercomputers indicate that the performance is limited by the memory bandwidth rather than the network bandwidth.Current systems provide a wealth of options to perform communication, and a compiler or user is faced with the difficulty of finding the communication operations that best use the available memory and network bandwidth. This paper provides a framework to evaluate different solutions for inter-node communication and presents the copy-transfer model; this model captures the contributions of the memory system to inter-node communication. We demonstrate the usefulness of this simple model by applying it to two commercial parallel systems, the Cray T3D and the Intel Paragon.In particular we identify two methods to transfer data between nodes in these two machines. In buffer-packing transfers, a contiguous block of data is transferred across the network. If the data are not stored contiguously, they are copied to (gathering) or from (scattering) buffers in local memory before and after the transfer. Chained transfers perform gathering, transfer and scattering in one step, reading the data elements with some non-sequential pattern and immediately transferring them on to the destination.Our model and measurements indicate that chaining of the gather, transfer, and scatter operations results in better performance than buffer packing for many important access patterns. Most standard message passing libraries (like MPI, PVM or NX) force the parallelizing compiler (or the programmer) to employ the buffer-packing communication operations. However, the addition of hardware support dedicated to communication (e.g., DMAs, line-transfer units) now gives the compiler a wider range of options.