Parallelization of loops with exits on pipelined architectures
Proceedings of the 1990 ACM/IEEE conference on Supercomputing
Register allocation for software pipelined loops
PLDI '92 Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation
A scalar architecture for pseudo vector processing based on slide-windowed registers
ICS '93 Proceedings of the 7th international conference on Supercomputing
The SP2 high-performance switch
IBM Systems Journal
Measurement of Communication Rates on the Cray T3D Interprocessor Network
HPCN Europe 1994 Proceedings of the nternational Conference and Exhibition on High-Performance Computing and Networking Volume II: Networking and Tools
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
On the Performance of Parallel Matrix Factorisation on the Hypermesh
The Journal of Supercomputing
Hypermeshes: implementation and performance
Journal of Systems Architecture: the EUROMICRO Journal
On the merits of hypermeshes and tori with adaptive routing
Journal of Systems Architecture: the EUROMICRO Journal
IPPS '97 Proceedings of the 11th International Symposium on Parallel Processing
A FIFO Queue Class Library as a State Variable of Time Warp Logical Processes
ISCOPE '98 Proceedings of the Second International Symposium on Computing in Object-Oriented Parallel Environments
On Line Visualization or Combining the Standard ORNL PVM with a Vendor PVM Implementation
Proceedings of the 6th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
A Scalable and High Capacity Router on Multi-Dimension Crossbar Switch Principle
LCN '01 Proceedings of the 26th Annual IEEE Conference on Local Computer Networks
RDMA control support for fine-grain parallel computations
Journal of Systems Architecture: the EUROMICRO Journal - Special issue: Parallel, distributed and network-based processing
Hamming hypermeshes: high performance interconnection networks for pin-out limited systems
Performance Evaluation
Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputers
CHINA HPC '07 Proceedings of the 2007 Asian technology information program's (ATIP's) 3rd workshop on High performance computing in China: solution approaches to impediments for high performance computing
A vector-parallel FFT with a user-specifiable data distribution scheme
ISPA'03 Proceedings of the 2003 international conference on Parallel and distributed processing and applications
Hi-index | 0.00 |
RISC-based Massively Parallel Processors (MPPs) often show low efficiency in real-world applications because of cache miss penalty, insufficient throughput of the memory system, and poor inter-processor communication performance. Hitachi's SR2201, an MPP scalable up to 2048 processors and 600 GFLOPS peak performance, overcomes these problems by introducing three novel features. First, its processor, the 150 MHz HARP-1E, solves the cache miss penalty by "pseudo vector processing" (PVP). In PVP, data is loaded by prefetching to a special register bank, bypassing the cache. Second, a multi-bank memory architecture that operates like a pipeline eliminates the memory system bottleneck. Third, the inter-processor communication achieves high performance on the three-dimensional crossbar network, using a "remote DMA transfer" protocol and a hardware-based cache coherency. As the result of these improvements, the SR2201 achieved 220.4 GFLOPS with 1024 processors in the LINPACK benchmark, which is almost 72% of the peak performance.