Experiences with VI communication for database storage
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
Predicting Multiprocessor Memory Access Patterns with Learning Models
ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning
A Dynamic Periodicity Detector: Application to Speedup Computation
IPDPS '01 Proceedings of the 15th International Parallel & Distributed Processing Symposium
CANPC '98 Proceedings of the Second International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Efficient Communication Using Message Prediction for Cluster Multiprocessors
CANPC '00 Proceedings of the 4th International Workshop on Network-Based Parallel Computing: Communication, Architecture, and Applications
Exploring the Predictability of MPI Messages
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
A New DMA Registration Strategy for Pinning-Based High Performance Networks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Receiving message prediction method
Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Can Memory-Less Network Adapters Benefit Next-Generation InfiniBand Systems?
HOTI '05 Proceedings of the 13th Symposium on High Performance Interconnects
Design of High Performance MVAPICH2: MPI2 over InfiniBand
CCGRID '06 Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid
Implications of application usage characteristics for collective communication offload
International Journal of High Performance Computing and Networking
Preserving time in large-scale communication traces
Proceedings of the 22nd annual international conference on Supercomputing
An efficient design for fast memory registration in RDMA
Journal of Network and Computer Applications
Scalable RDMA performance in PGAS languages
IPDPS '09 Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing
Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems
IEEE Transactions on Parallel and Distributed Systems
Designing Next Generation Clusters: Evaluation of InfiniBand DDR/QDR on Intel Computing Platforms
HOTI '09 Proceedings of the 2009 17th IEEE Symposium on High Performance Interconnects
Infiniband scalability in open MPI
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
High performance RDMA protocols in HPC
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
Analysis of the memory registration process in the mellanox infiniband software stack
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Investigations on InfiniBand: efficient network buffer utilization at scale
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
Hi-index | 0.00 |
Remote DMA (RDMA) enables high performance networks to reduce data copying between an application and the operating system (OS). However RDMA operations in some high performance networks require communication memory explicitly registered with the network adapter and pinned by the OS. Memory registration and pinning limits the flexibility of the memory system and reduces the amount of memory that user processes can allocate. These issues become more significant on multicore platforms, since registered memory demand grows linearly with the number of processor cores. In this paper we propose a new memory registration/deregistration strategy to reduce registered memory on multicore architectures for HPC applications. We hide the cost of dynamic memory management by offloading all dynamic memory registration and deregistration requests to a dedicated memory management helper thread. We investigate design policies and performance implications of the helper thread approach. We evaluate our framework with the NAS parallel benchmarks, for which our registration scheme significantly reduces the registered memory (23.62% on average and up to 49.39%) and avoids memory registration/deregistration costs for reused communication memory. We show that our system enables the execution of problem sizes that could not complete under existing memory registration strategies.