An enhancer of memory and network for applications with large-capacity data and non-continuous data accessing

Authors:
Noboru Tanabe;Hirotaka Hakozaki;Hiroshi Ando;Yasunori Dohi;Zhengzhe Luo;Hironori Nakajo
Affiliations:
Toshiba R&D Center, Kawasaki, Japan;Yokohama National University, Yokohama, Japan;Yokohama National University, Yokohama, Japan;Yokohama National University, Yokohama, Japan;Tokyo University of Agriculture and Technology, Tokyo, Japan;Tokyo University of Agriculture and Technology, Tokyo, Japan
Venue:
The Journal of Supercomputing
Year:
2010

Citing 13
Cited 1

SimpleScalar: An Infrastructure for Computer System Modeling

Computer
Basic Design of the Earth Simulator

ISHPC '99 Proceedings of the Second International Symposium on High Performance Computing
Impulse: Building a Smarter Memory Controller

HPCA '99 Proceedings of the 5th International Symposium on High Performance Computer Architecture
A New Memory Module for Memory Intensive Applications

PARELEC '04 Proceedings of the international conference on Parallel Computing in Electrical Engineering
A New Memory Module for COTS-Based Personal Supercomputing

IWIA '04 Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems
Highly Functional Memory Architecture for Large-Scale Data Applications

IWIA '04 Proceedings of the Innovative Architecture for Future Generation High-Performance Processors and Systems
Scatter-Add in Data Parallel Architectures

HPCA '05 Proceedings of the 11th International Symposium on High-Performance Computer Architecture
Merrimac: Supercomputing with Streams

Proceedings of the 2003 ACM/IEEE conference on Supercomputing
QsNetII: Defining High-Performance Network Design

IEEE Micro
Cell Multiprocessor Communication Network: Built for Speed

IEEE Micro
Hardware Support for MPI in DIMMnet-2 Network Interface

IWIA '06 Proceedings of the International Workshop on Innovative Architecture for Future Generation High Performance Processors and Systems
Performance evaluation on low-latency Communication mechanism of DIMMnet-2

PDCN'07 Proceedings of the 25th conference on Proceedings of the 25th IASTED International Multi-Conference: parallel and distributed computing and networks
An Enhancer of Memory and Network for Cluster and its Applications

PDCAT '08 Proceedings of the 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies

A memory accelerator with gather functions for bandwidth-bound irregular applications

Proceedings of the first workshop on Irregular applications: architectures and algorithm

Quantified Score

Hi-index	0.00

Visualization

Abstract

The performance of memory and I/O systems is insufficient to catch up with that of COTS (Commercial Off-The-Shelf) CPU. PC clusters using COTS CPU have been employed for HPC. A cache-based processor is far less effective than a vector processor in applications with low spatial locality. Moreover, for HPC, Google-like server farms and database processing, insufficient capacity of main memory poses a serious problem. Power consumption of a Google-like server farm or a high-end HPC PC cluster is huge. In order to overcome these problems, we propose a concept of a memory and network enhancer equipped with scatter and gather vector access functions, high-performance network connectivity, and capacity extensibility. Communication mechanisms named LHS and LHC are also proposed. LHS and LHC are architectures for reducing latency for mixed messages with small controlling data and large data body. Examples of the killer applications of this new type of hardware are presented. This paper presents not only concepts and simulations but also real hardware prototypes named DIMMnet-2 and DIMMnet-3. This paper presents the evaluations concerning memory issues and network issues. We evaluate the module with NAS CG benchmark class C and Wisconsin benchmarks as applications with memory issues. Although evaluation for CG class C is difficult with conventional cycle-accurate simulation methods, we obtained the result for class C with our original method. As a result, we find that the module can improve its maximum performance about 25 times more with Wisconsin benchmarks. However, the results on a cache-based PC show the cache-line flushing degrades acceleration ratio. This shows the high potential of the proposed extended memory module and processors in combination with DMA-based main memory access such as SPU on Cell/B.E. that does not need cache-line flushing. The LHS and LHC communication mechanisms are evaluated in this paper. The evaluations of their effects on latency are shown.