A Resource Optimized Remote-Memory-Access Architecture for Low-latency Communication

Authors:
Mondrian Nussle;Martin Scherer;Ulrich Bruning
Affiliations:
-;-;-
Venue:
ICPP '09 Proceedings of the 2009 International Conference on Parallel Processing
Year:
2009

Citing 0
Cited 3

A cluster computer performance predictor for memory scheduling

ICA3PP'11 Proceedings of the 11th international conference on Algorithms and architectures for parallel processing - Volume Part II
A cost-effective heuristic to schedule local and remote memory in cluster computers

The Journal of Supercomputing
On the scalability of the clusters-booster concept: a critical assessment of the DEEP architecture

Proceedings of the Future HPC Systems: the Challenges of Power-Constrained Performance

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper introduces a new highly optimized architecture for remote memory access (RMA). RMA, using put and get operations, is a one-sided communication function which amongst others is important in current and upcoming Partitioned Global Address Space (PGAS) systems. In this work, a virtualized hardware unit is described which is resource optimized, exhibits high overlap, processor offload and very good latency characteristics. To start an RMA operation a single HyperTransport packet caused by one CPU instruction is sufficient, thus reducing latency to an absolute minimum. In addition to the basic architecture an implementation in FPGA technology is presented together with an evaluation of the target ASIC-implementation. The current system can sustain more than 4.9 million transactions per second on the FPGA and exhibits an end-to-end latency of 1.2 μs for an 8-byte put operation. Both values are limited by the FPGA technology used for the prototype implementation. An estimation of the performance reachable on ASIC technology suggests that application to application latencies of less than 500 ns are feasible.