LogP: towards a realistic model of parallel computation
PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Effects of communication latency, overhead, and bandwidth in a cluster architecture
Proceedings of the 24th annual international symposium on Computer architecture
MPI-LAPI: An Efficient Implementation of MPI for IBM RS/6000 SP Systems
IEEE Transactions on Parallel and Distributed Systems
MPI-StarT: delivering network performance to numerical applications
SC '98 Proceedings of the 1998 ACM/IEEE conference on Supercomputing
Experiences with VI communication for database storage
ISCA '02 Proceedings of the 29th annual international symposium on Computer architecture
The Virtual Interface Architecture
IEEE Micro
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
A Strategy to Compute the InfiniBand Arbitration Tables
IPDPS '02 Proceedings of the 16th International Parallel and Distributed Processing Symposium
Structure and Performance of the Direct Access File System
ATEC '02 Proceedings of the General Track of the annual conference on USENIX Annual Technical Conference
Ultra-high performance communication with MPI and the Sun fire™ link interconnect
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Impact of On-Demand Connection Management in MPI over VIA
CLUSTER '02 Proceedings of the IEEE International Conference on Cluster Computing
Efficient Collective Operations Using Remote Memory Operations on VIA-Based Clusters
IPDPS '03 Proceedings of the 17th International Symposium on Parallel and Distributed Processing
User-Level Communication in Cluster-Based Servers
HPCA '02 Proceedings of the 8th International Symposium on High-Performance Computer Architecture
Pin-down Cache: A Virtual Memory Management Technique for Zero-copy Communication
IPPS '98 Proceedings of the 12th. International Parallel Processing Symposium on International Parallel Processing Symposium
Building Multirail InfiniBand Clusters: MPI-Level Design and Performance Evaluation
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
PRESS: A Clustered Server Based on User-Level Communication
IEEE Transactions on Parallel and Distributed Systems
Designing Efficient Java Communications on Clusters
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 5 - Volume 06
International Journal of High Performance Computing Applications
International Journal of High Performance Computing Applications
Platform Overlays: enabling in-network stream processing in large-scale distributed applications
NOSSDAV '05 Proceedings of the international workshop on Network and operating systems support for digital audio and video
Design and Implementation of Multiple Fault-Tolerant MPI over Myrinet (M^3)
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
Memory efficient parallel matrix multiplication operation for irregular problems
Proceedings of the 3rd conference on Computing frontiers
RDMA control support for fine-grain parallel computations
Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Parallel, distributed and network-based processing
A case for high performance computing with virtual machines
Proceedings of the 20th annual international conference on Supercomputing
Scanning workstation memory for malicious codes using dedicated coprocessors
Proceedings of the 44th annual Southeast regional conference
Scalable algorithms for molecular dynamics simulations on commodity clusters
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
The Journal of Supercomputing
Coprocessor design to support MPI primitives in configurable multiprocessors
Integration, the VLSI Journal
Nomad: migrating OS-bypass networks in virtual machines
Proceedings of the 3rd international conference on Virtual execution environments
High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters
Proceedings of the 21st annual international conference on Supercomputing
High-performance ethernet-based communications for future multi-core processors
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Performance implications of virtualizing multicore cluster machines
Proceedings of the 2nd workshop on System-level virtualization for high performance computing
Proceedings of the 23rd international conference on Supercomputing
RSP '09 Proceedings of the 2009 IEEE/IFIP International Symposium on Rapid System Prototyping
Infiniband scalability in open MPI
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Unifying UPC and MPI runtimes: experience with MVAPICH
Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model
High performance RDMA protocols in HPC
EuroPVM/MPI'06 Proceedings of the 13th European PVM/MPI User's Group conference on Recent advances in parallel virtual machine and message passing interface
SHIELD: a fault-tolerant MPI for an infiniband cluster
HPCC'06 Proceedings of the Second international conference on High Performance Computing and Communications
Open MPI: a flexible high performance MPI
PPAM'05 Proceedings of the 6th international conference on Parallel Processing and Applied Mathematics
Analysis of the memory registration process in the mellanox infiniband software stack
Euro-Par'06 Proceedings of the 12th international conference on Parallel Processing
Tolerating message latency through the early release of blocked receives
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Architecture and early performance of the new IBM HPS fabric and adapter
HiPC'04 Proceedings of the 11th international conference on High Performance Computing
High-performance RMA-based broadcast on the intel SCC
Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures
Barely alive memory servers: Keeping data active in a low-power state
ACM Journal on Emerging Technologies in Computing Systems (JETC)
RDMA in the SiCortex cluster systems
PVM/MPI'07 Proceedings of the 14th European conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface
An efficient kernel-level blocking MPI implementation
EuroMPI'12 Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface
Proceedings of the 20th European MPI Users' Group Meeting
Hybrid MPI: efficient message passing for multi-core systems
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of communication subsystems. One of these features is Remote Direct Memory Access (RDMA) operations. In this paper, we propose a new design of MPI over InfiniBand which brings the benefit of RDMA to not only large messages, but also small and control messages. We also achieve better scalability by exploiting application communication pattern and combining send/receive operations with RDMA operations. Our RDMA-based MPI implementation currently delivers a latency of 6.8 microseconds for small messages and a peak bandwidth of 871 Million Bytes (831 Mega Bytes) per second. Performance evaluation at the MPI level shows that for small messages, our RDMA-based design can reduce the latency by 24%, increase the bandwidth by over 104%, and reduce the host overhead by up to 22%. For large messages, we improve performance by reducing the time for transferring control messages. We have also shown that our new design is beneficial to MPI collective communication and NAS Parallel Benchmarks.