Resource Scaling Effects on MPP Performance: The STAP Benchmark Implications

Authors:
Kai Hwang;Cho-Li Wang;Choming Wang;Zhiwei Xu
Affiliations:
Univ. of Southern California, Los Angeles;Univ. of Hong Kong, Hong Kong;Univ. of Southern California, Los Angeles;National Center for Intelligent Computing Systems, Beijing, China
Venue:
IEEE Transactions on Parallel and Distributed Systems
Year:
1999

Citing 23
Cited 1

Reevaluating Amdahl's law

Communications of the ACM
A bridging model for parallel computation

Communications of the ACM
LogP: towards a realistic model of parallel computation

PPOPP '93 Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming
Implications of hierarchical N-body methods for multiprocessor architectures

ACM Transactions on Computer Systems (TOCS)
Public international benchmarks for parallel computers: PARKBENCH committee: Report-1

Scientific Programming
SP2 system architecture

IBM Systems Journal
The SP2 high-performance switch

IBM Systems Journal
The SPLASH-2 programs: characterization and methodological considerations

ISCA '95 Proceedings of the 22nd annual international symposium on Computer architecture
Benchmark Evaluation of the IBM SP2 for Parallel Signal Processing

IEEE Transactions on Parallel and Distributed Systems
Early prediction of MPP performance: the SP2, T3D, and Paragon experiences

Parallel Computing
Effects of communication latency, overhead, and bandwidth in a cluster architecture

Proceedings of the 24th annual international symposium on Computer architecture
The SGI Origin: a ccNUMA highly scalable server

Proceedings of the 24th annual international symposium on Computer architecture
The Parallel Evaluation of General Arithmetic Expressions

Journal of the ACM (JACM)
The C31 parallel benchmark suite - introduction and preliminary results

Supercomputing '96 Proceedings of the 1996 ACM/IEEE conference on Supercomputing
Scalable Parallel Computing: Technology,Architecture,Programming

Scalable Parallel Computing: Technology,Architecture,Programming
Modeling Communication Overhead: MPI and MPL Performance on the IBM SP2

IEEE Parallel & Distributed Technology: Systems & Technology
Performance Prediction: A Case Study Using a Scalable Shared-Virtual-Memory Machine

IEEE Parallel & Distributed Technology: Systems & Technology
ASCI Pathforward: to 30 Tflops and Beyond

IEEE Concurrency
Designing SSI Clusters with Hierarchical Checkpointing and Single I/O Space

IEEE Concurrency
The DASH Prototype: Logic Overhead and Performance

IEEE Transactions on Parallel and Distributed Systems
A TeraFLOP Supercomputer in 1996: The ASCI TFLOP System

IPPS '96 Proceedings of the 10th International Parallel Processing Symposium
Evaluating MPI Collective Communication on the SP2, T3D, and Pargon Multicomputers

HPCA '97 Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture
The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors

The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors

Benchmarking parallel compilers: a UPC case study

Future Generation Computer Systems - Systems performance analysis and evaluation

Quantified Score

Hi-index	0.00

Visualization

Abstract

Presently, massively parallel processors (MPPs) are available only in a few commercial models. A sequence of three ASCI Teraflops MPPs has appeared before the new millennium. This paper evaluates six MPP systems through STAP benchmark experiments. The STAP is a radar signal processing benchmark which exploits regularly structured SPMD data parallelism. We reveal the resource scaling effects on MPP performance along orthogonal dimensions of machine size, processor speed, memory capacity, messaging latency, and network bandwidth. We show how to achieve balanced resources scaling against enlarged workload (problem size). Among three commercial MPPs, the IBM SP2 shows the highest speed and efficiency, attributed to its well-designed network with middleware support for single system image. The Cray T3D demonstrates a high network bandwidth with a good NUMA memory hierarchy. The Intel Paragon trails far behind due to slow processors used and excessive latency experienced in passing messages. Our analysis projects the lowest STAP speed on the ASCI Red, compared with the projected speed of two ASCI Blue machines. This is attributed to slow processors used in ASCI Red and the mismatch between its hardware and software. The Blue Pacific shows the highest potential to deliver scalable performance up to thousands of nodes. The Blue Mountain is designed to have the highest network bandwidth. Our results suggest a limit on the scalability of the distributed shared-memory (DSM) architecture adopted in Blue Mountain. The scaling model offers a quantitative method to match resource scaling with problem scaling to yield a truly scalable performance. The model helps MPP designers optimize the processors, memory, network, and I/O subsystems of an MPP. For MPP users, the scaling results can be applied to partition a large workload for SPMD execution or to minimize the software overhead in collective communication or remote memory update operations. Finally, our scaling model is assessed to evaluate MPPs with benchmarks other than STAP.