High performance RDMA-based MPI implementation over InfiniBand
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
A New DMA Registration Strategy for Pinning-Based High Performance Networks
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
Pin-down Cache: A Virtual Memory Management Technique for Zero-copy Communication
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Providing a High-Performance VIA-Module for LAM/MPI
PARELEC '04 Proceedings of the international conference on Parallel Computing in Electrical Engineering
IEEE Transactions on Parallel and Distributed Systems
Unifier: unifying cache management and communication buffer management for PVFS over InfiniBand
CCGRID '04 Proceedings of the 2004 IEEE International Symposium on Cluster Computing and the Grid
High performance RDMA based all-to-all broadcast for infiniband clusters
HiPC'05 Proceedings of the 12th international conference on High Performance Computing
Implementation and performance analysis of non-blocking collective operations for MPI
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
LogGOPSim: simulating large-scale applications in the LogGOPS model
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
Scalable memory registration for high performance networks using helper threads
Proceedings of the 8th ACM International Conference on Computing Frontiers
A high performance superpipeline protocol for infiniband
Euro-Par'11 Proceedings of the 17th international conference on Parallel processing - Volume Part II
Hi-index | 0.00 |
To leverage high speed interconnects like InfiniBand it is important to minimize the communication overhead. The most interfering overhead is the registration of communication memory. In this paper, we present our analysis of the memory registration process inside the Mellanox InfiniBand driver and possible ways out of this bottleneck. We evaluate and characterize the most time consuming parts in the execution path of the memory registration function using the Read Time Stamp Counter (RDTSC) instruction. We present measurements on AMD Opteron and Intel Xeon systems with different types of Host Channel Adapters for PCI-X and PCI-Express. Finally, we conclude with first results using Linux hugepage support to shorten the time of registering a memory region.