A bridging model for parallel computation
Communications of the ACM
A message passing coprocessor for distributed memory multicomputers
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Effects of communication latency, overhead, and bandwidth in a cluster architecture
Proceedings of the 24th annual international symposium on Computer architecture
BSPlib: The BSP programming library
Parallel Computing
Route packets, not wires: on-chip inteconnection networks
Proceedings of the 38th annual Design Automation Conference
Advanced topics in MPI programming
Beowulf cluster computing with Linux
An overview of the BlueGene/L Supercomputer
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
The Paderborn University BSP (PUB) library
Parallel Computing
High performance RDMA-based MPI implementation over InfiniBand
ICS '03 Proceedings of the 17th annual international conference on Supercomputing
GASNet Specification, v1.1
FPGAs vs. CPUs: trends in peak floating-point performance
FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Evaluating support for global address space languages on the Cray X1
Proceedings of the 18th annual international conference on Supercomputing
SCI Networking for Shared-Memory Computing in UPC: Blueprints of the GASNet SCI Conduit
LCN '04 Proceedings of the 29th Annual IEEE International Conference on Local Computer Networks
Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Hardware Acceleration Unit for MPI Queue Processing
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Enhancing NIC Performance for MPI using Processing-in-Memory
IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
BEE2: A High-End Reconfigurable Computing System
IEEE Design & Test
Concurrency and Computation: Practice & Experience
Concurrency and Computation: Practice & Experience
Generation of embedded hardware/software from systemC
EURASIP Journal on Embedded Systems
Hi-index | 0.00 |
The Message Passing Interface (MPI) is a widely used standard for interprocessor communications in parallel computers and PC clusters. Its functions are normally implemented in software due to their enormity and complexity, thus resulting in large communication latencies. Limited hardware support for MPI is sometimes available in expensive systems. Reconfigurable computing has recently reached rewarding levels that enable the embedding of programmable parallel systems of respectable size inside one or more Field-Programmable Gate Arrays (FPGAs). Nevertheless, specialized components must be built to support interprocessor communications in these FPGA-based designs, and the resulting code may be difficult to port to other reconfigurable platforms. In addition, performance comparison with conventional parallel computers and PC clusters is very cumbersome or impossible since the latter often employ MPI or similar communication libraries. The introduction of a hardware design to implement directly MPI primitives in configurable multiprocessor computing creates a framework for efficient parallel code development involving data exchanges independently of the underlying hardware implementation. This process also supports the portability of MPI-based code developed for more conventional platforms. This paper takes advantage of the effectiveness and efficiency of one-sided Remote Memory Access (RMA) communications, and presents the design and evaluation of a coprocessor that implements a set of MPI primitives for RMA. These primitives form a universal and orthogonal set that can be used to implement any other MPI function. To evaluate the coprocessor, a router of low latency was designed as well to enable the direct interconnection of several coprocessors in cluster-on-a-chip systems. Experimental results justify the implementation of the MPI primitives in hardware to support parallel programming in reconfigurable computing. Under continuous traffic, results for a Xilinx XC2V6000 FPGA show that the average transmission time per 32-bit word is about 1.35 clock cycles. Although other computing platforms, such as PC clusters, could benefit as well from our design methodology, our focus is exclusively reconfigurable multiprocessing that has recently received tremendous attention in academia and industry.