Network Interface Architecture for Scalable Message Queue Processing

Authors:
Noboru Tanabe;Atsushi Ohta;Pulung Waskito;Hironori Nakajo
Affiliations:
-;-;-;-
Venue:
ICPADS '09 Proceedings of the 2009 15th International Conference on Parallel and Distributed Systems
Year:
2009

Citing 0
Cited 2

Significantly reducing MPI intercommunication latency and power overhead in both embedded and HPC systems

ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
A fast and resource-conscious MPI message queue mechanism for large-scale jobs

Future Generation Computer Systems

Quantified Score

Hi-index	0.00

Visualization

Abstract

Most of scientists except computer scientists do not want to make efforts for performance tuning with rewriting their MPI applications. In addition, the number of processing elements which can be used by them is increasing year by year. On large-scale parallel systems, the number of accumulated messages on a message buffer tends to increase in some of their applications. Since searching message queue in MPI is time-consuming, system side scalable acceleration is needed for those systems. In this paper, a support function named LHS (Limited-length Head Separation) is proposed. Its performance in searching message buffer and hardware cost are evaluated. LHS accelerates searching message buffer by means of switching location to store limited-length heads of messages. It uses the effects such as increasing hit rate of cache on host with partial off-loading to hardware. Searching speed of message buffer when the order of message reception is different from the receiver's expectation is accelerated 14.3 times with LHS on FPGA-based network interface card (NIC) named DIMMnet-2. This absolute performance is 38.5 times higher than that of IBM BlueGene/P although the frequency is 8.5times slower than BlueGene/P. Hardware cost of LHS is significantly lower than that of ALPU, which is a hardware accelerator for searching message buffer. LHS has higher scalability than ALPU in the performance per frequency. Therefore, LHS is more suitable for larger parallel systems.