Architecture and Performance of the Hitachi SR2201 Massively Parallel Processor System

Authors:
Hiroaki Fujii;Yoshiko Yasuda;Hideya Akashi;Yasuhiro Inagami;Makoto Koga;Osamu Ishihara;Masamori Kashiyama;Hideo Wada;Tsutomu Sumimoto
Affiliations:
-;-;-;-;-;-;-;-;-
Venue:
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
Year:
1997

Citing 6
Cited 11

Parallelization of loops with exits on pipelined architectures

Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Register allocation for software pipelined loops

PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
A scalar architecture for pseudo vector processing based on slide-windowed registers

ICS '93 Proceedings of the 7th international conference on Supercomputing
The SP2 high-performance switch

IBM Systems Journal
Measurement of Communication Rates on the Cray T3D Interprocessor Network

HPCN Europe 1994 Proceedings of the nternational Conference and Exhibition on High-Performance Computing and Networking Volume II: Networking and Tools
Deadlock-Free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing

On the Performance of Parallel Matrix Factorisation on the Hypermesh

The Journal of Supercomputing
Hypermeshes: implementation and performance

Journal of Systems Architecture: the EUROMICRO Journal
On the merits of hypermeshes and tori with adaptive routing

Journal of Systems Architecture: the EUROMICRO Journal
Deadlock-Free Fault-tolerant Routing in the Multi-dimensional Crossbar Network and Its Implementation for the Hitachi SR2201

IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
A FIFO Queue Class Library as a State Variable of Time Warp Logical Processes

ISCOPE '98 Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environments
On Line Visualization or Combining the Standard ORNL PVM with a Vendor PVM Implementation

Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A Scalable and High Capacity Router on Multi-Dimension Crossbar Switch Principle

LCN '01 Proceedings of the 26th Annual IEEE Conference on Local Computer Networks
RDMA control support for fine-grain parallel computations

Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Parallel, distributed and network-based processing
Hamming hypermeshes: high performance interconnection networks for pin-out limited systems

Performance Evaluation
Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputers

CHINA HPC '07 Proceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing
A vector-parallel FFT with a user-specifiable data distribution scheme

ISPA'03 Proceedings of the 2003 international conference on Parallel and distributed processing and applications

Quantified Score

Hi-index	0.00

Visualization

Abstract

RISC-based Massively Parallel Processors (MPPs) often show low efficiency in real-world applications because of cache miss penalty, insufficient throughput of the memory system, and poor inter-processor communication performance. Hitachi's SR2201, an MPP scalable up to 2048 processors and 600 GFLOPS peak performance, overcomes these problems by introducing three novel features. First, its processor, the 150 MHz HARP-1E, solves the cache miss penalty by "pseudo vector processing" (PVP). In PVP, data is loaded by prefetching to a special register bank, bypassing the cache. Second, a multi-bank memory architecture that operates like a pipeline eliminates the memory system bottleneck. Third, the inter-processor communication achieves high performance on the three-dimensional crossbar network, using a "remote DMA transfer" protocol and a hardware-based cache coherency. As the result of these improvements, the SR2201 achieved 220.4 GFLOPS with 1024 processors in the LINPACK benchmark, which is almost 72% of the peak performance.