Coprocessor design to support MPI primitives in configurable multiprocessors

Authors:
Sotirios G. Ziavras;Alexandros V. Gerbessiotis;Rohan Bafna
Affiliations:
Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102, USA;Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, USA;Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102, USA
Venue:
Integration, the VLSI Journal
Year:
2007

Citing 22
Cited 1

A bridging model for parallel computation

Communications of the ACM
A message passing coprocessor for distributed memory multicomputers

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing
Effects of communication latency, overhead, and bandwidth in a cluster architecture

Proceedings of the 24th annual international symposium on Computer architecture
BSPlib: The BSP programming library

Parallel Computing
Route packets, not wires: on-chip inteconnection networks

Proceedings of the 38th annual Design Automation Conference
Advanced topics in MPI programming

Beowulf cluster computing with Linux
Networks on Chips: A New SoC Paradigm

Computer
An overview of the BlueGene/L Supercomputer

Proceedings of the 2002 ACM/IEEE conference on Supercomputing
The Paderborn University BSP (PUB) library

Parallel Computing
High performance RDMA-based MPI implementation over InfiniBand

ICS '03 Proceedings of the 17th annual international conference on Supercomputing
GASNet Specification, v1.1

GASNet Specification, v1.1
FPGAs vs. CPUs: trends in peak floating-point performance

FPGA '04 Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays
Evaluating support for global address space languages on the Cray X1

Proceedings of the 18th annual international conference on Supercomputing
SCI Networking for Shared-Memory Computing in UPC: Blueprints of the GASNet SCI Conduit

LCN '04 Proceedings of the 29th Annual IEEE International Conference on Local Computer Networks
Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
A Hardware Acceleration Unit for MPI Queue Processing

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01
Enhancing NIC Performance for MPI using Processing-in-Memory

IPDPS '05 Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 9 - Volume 10
BEE2: A High-End Reconfigurable Computing System

IEEE Design & Test
Parallel LU factorization of sparse matrices on FPGA-based configurable computing engines: Research Articles

Concurrency and Computation: Practice & Experience
The performance and scalability of SHMEM and MPI-2 one-sided routines on a SGI Origin 2000 and a Cray T3E-600: Performances

Concurrency and Computation: Practice & Experience
Remote memory access: A case for portable, efficient and library independent parallel programming

Scientific Programming

Generation of embedded hardware/software from systemC

EURASIP Journal on Embedded Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

The Message Passing Interface (MPI) is a widely used standard for interprocessor communications in parallel computers and PC clusters. Its functions are normally implemented in software due to their enormity and complexity, thus resulting in large communication latencies. Limited hardware support for MPI is sometimes available in expensive systems. Reconfigurable computing has recently reached rewarding levels that enable the embedding of programmable parallel systems of respectable size inside one or more Field-Programmable Gate Arrays (FPGAs). Nevertheless, specialized components must be built to support interprocessor communications in these FPGA-based designs, and the resulting code may be difficult to port to other reconfigurable platforms. In addition, performance comparison with conventional parallel computers and PC clusters is very cumbersome or impossible since the latter often employ MPI or similar communication libraries. The introduction of a hardware design to implement directly MPI primitives in configurable multiprocessor computing creates a framework for efficient parallel code development involving data exchanges independently of the underlying hardware implementation. This process also supports the portability of MPI-based code developed for more conventional platforms. This paper takes advantage of the effectiveness and efficiency of one-sided Remote Memory Access (RMA) communications, and presents the design and evaluation of a coprocessor that implements a set of MPI primitives for RMA. These primitives form a universal and orthogonal set that can be used to implement any other MPI function. To evaluate the coprocessor, a router of low latency was designed as well to enable the direct interconnection of several coprocessors in cluster-on-a-chip systems. Experimental results justify the implementation of the MPI primitives in hardware to support parallel programming in reconfigurable computing. Under continuous traffic, results for a Xilinx XC2V6000 FPGA show that the average transmission time per 32-bit word is about 1.35 clock cycles. Although other computing platforms, such as PC clusters, could benefit as well from our design methodology, our focus is exclusively reconfigurable multiprocessing that has recently received tremendous attention in academia and industry.