Scalable memory registration for high performance networks using helper threads

Authors:
Dong Li;Kirk W. Cameron;Dimitrios S. Nikolopoulos;Bronis R. de Supinski;Martin Schulz
Affiliations:
Virginia Tech;Virginia Tech;FORTH-ICS and University of Crete, Greece;Lawrence Livermore National Lab;Lawrence Livermore National Lab
Venue:
Proceedings of the 8th ACM International Conference on Computing Frontiers
Year:
2011

Citing 23
Cited 0

Experiences with VI communication for database storage

ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Myrinet: A Gigabit-per-Second Local Area Network

IEEE Micro
The Quadrics Network: High-Performance Clustering Technology

IEEE Micro
Predicting Multiprocessor Memory Access Patterns with Learning Models

ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Dynamic Periodicity Detector: Application to Speedup Computation

IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
Characterization of Communication Patterns in Message-Passing Parallel Scientific Application Programs

CANPC '98 Proceedings of the Second International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Efficient Communication Using Message Prediction for Cluster Multiprocessors

CANPC '00 Proceedings of the 4th International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Exploring the Predictability of MPI Messages

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A New DMA Registration Strategy for Pinning-Based High Performance Networks

IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Receiving message prediction method

Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?

HOTI '05 Proceedings of the 13th Symposium on High Performance Interconnects
Design of High Performance MVAPICH2: MPI2 over InfiniBand

CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Implications of application usage characteristics for collective communication offload

International Journal of High Performance Computing and Networking
Preserving time in large-scale communication traces

Proceedings of the 22nd annual international conference on Supercomputing
Runahead Execution: An Effective Alternative to Large Instruction Windows

IEEE Micro
An efficient design for fast memory registration in RDMA

Journal of Network and Computer Applications
Scalable RDMA performance in PGAS languages

IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems

IEEE Transactions on Parallel and Distributed Systems
Designing Next Generation Clusters: Evaluation of InfiniBand DDR/QDR on Intel Computing Platforms

HOTI '09 Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects
Infiniband scalability in open MPI

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
High performance RDMA protocols in HPC

EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Analysis of the memory registration process in the mellanox infiniband software stack

Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Investigations on InfiniBand: efficient network buffer utilization at scale

PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface

Quantified Score

Hi-index	0.00

Visualization

Abstract

Remote DMA (RDMA) enables high performance networks to reduce data copying between an application and the operating system (OS). However RDMA operations in some high performance networks require communication memory explicitly registered with the network adapter and pinned by the OS. Memory registration and pinning limits the flexibility of the memory system and reduces the amount of memory that user processes can allocate. These issues become more significant on multicore platforms, since registered memory demand grows linearly with the number of processor cores. In this paper we propose a new memory registration/deregistration strategy to reduce registered memory on multicore architectures for HPC applications. We hide the cost of dynamic memory management by offloading all dynamic memory registration and deregistration requests to a dedicated memory management helper thread. We investigate design policies and performance implications of the helper thread approach. We evaluate our framework with the NAS parallel benchmarks, for which our registration scheme significantly reduces the registered memory (23.62% on average and up to 49.39%) and avoids memory registration/deregistration costs for reused communication memory. We show that our system enables the execution of problem sizes that could not complete under existing memory registration strategies.