Performance and Scalability Analysis of Teraflop-Scale Parallel Architectures Using Multidimensional Wavefront Applications

Authors:
Adolfy Hoisie;Olaf Lubeck;Harvey Wasserman
Affiliations:
Scientific Computing Group, Los Alamos National Laboratory, Los Alamos, New Mexico;Scientific Computing Group, Los Alamos National Laboratory, Los Alamos, New Mexico;Scientific Computing Group, Los Alamos National Laboratory, Los Alamos, New Mexico
Venue:
International Journal of High Performance Computing Applications
Year:
2000

Citing 5
Cited 37

Parallel solution of triangular systems on distributed-memory multiprocessors

SIAM Journal on Scientific and Statistical Computing
In search of clusters: the coming battle in lowly parallel computing

In search of clusters: the coming battle in lowly parallel computing
LogP: a practical model of parallel computation

Communications of the ACM
The parallel execution of DO loops

Communications of the ACM
The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors

The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors

Predictive performance and scalability modeling of a large-scale application

Proceedings of the 2001 ACM/IEEE conference on Supercomputing
POEMS: End-to-End Performance Design of Large Parallel Adaptive Computational Systems

IEEE Transactions on Software Engineering
Performance Modeling of Distributed Hybrid Architectures

IEEE Transactions on Parallel and Distributed Systems
Performance modeling of deterministic transport computations

Performance analysis and grid computing
Verifying large-scale system performance during installation using modelling

High performance scientific and engineering computing
A Performance and Scalability Analysis of the BlueGene/L Architecture

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Will Moore's Law Be Sufficient?

Proceedings of the 2004 ACM/IEEE conference on Supercomputing
An empirical performance analysis of commodity memories in commodity servers

MSP '04 Proceedings of the 2004 workshop on Memory system performance
Parallel Simulation of Large-Scale Parallel Applications

International Journal of High Performance Computing Applications
A General Performance Model of Structured and Unstructured Mesh Particle Transport Computations

The Journal of Supercomputing
A Performance Model of the Parallel Ocean Program

International Journal of High Performance Computing Applications
A performance model of non-deterministic particle transport on large-scale systems

Future Generation Computer Systems
Performance feature identification by comparative trace analysis

Future Generation Computer Systems
A performance comparison through benchmarking and modeling of three leading supercomputers: blue Gene/L, Red Storm, and Purple

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications

Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Designing a Cluster for Your Application

Computing in Science and Engineering
Entering the petaflop era: the architecture and performance of Roadrunner

Proceedings of the 2008 ACM/IEEE conference on Supercomputing
PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node

Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A performance model of non-deterministic particle transport on large-scale systems

Future Generation Computer Systems
Performance feature identification by comparative trace analysis

Future Generation Computer Systems
A performance model of non-deterministic particle transport on large-scale systems

ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
A look at application performance sensitivity to the bandwidth and latency of infiniband networks

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Dynamic performance prediction of an adaptive mesh application

IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method

Journal of Computational Physics
Automatic generation of executable communication specifications from parallel applications

Proceedings of the international conference on Supercomputing
Predictive analysis of a hydrodynamics application on large-scale CMP clusters

Computer Science - Research and Development
Auto-generation of communication benchmark traces

Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems
Performance modeling: understanding the past and predicting the future

Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Optimizing sweep3d for graphic processor unit

ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
ScalaExtrap: Trace-based communication extrapolation for SPMD programs

ACM Transactions on Programming Languages and Systems (TOPLAS)
PAS2P tool, parallel application signature for performance prediction

PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Auto-generation of communication benchmark traces

ACM SIGMETRICS Performance Evaluation Review
Performance analysis of an optical circuit switched network for peta-scale systems

Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Parallel algorithms for Sn transport sweeps on unstructured meshes

Journal of Computational Physics
Elastic and scalable tracing and accurate replay of non-deterministic events

Proceedings of the 27th international ACM conference on International conference on supercomputing
Using automated performance modeling to find scalability bugs in complex codes

SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
C2FPGA-A dependency-timing graph design methodology

Journal of Parallel and Distributed Computing

Quantified Score

Hi-index	0.01

Visualization

Abstract

The authors develop a model for the parallel performance of algorithms that consist of concurrent, two-dimensional wavefronts implemented in a message-passing environment. The model, based on a LogGP machine parameterization, combines the separate contributions of computation and communication wavefronts. The authors validate the model on three important supercomputer systems, on up to 500 processors. They use data from a deterministic particle transport application taken from the ASCI workload, although the model is general to any wavefront algorithm implemented on a 2-D processor domain. They also use the validated model to make estimates of performance and scalability of wavefront algorithms on 100 TFLOPS computer systems expected to be in existence within the next decade as part of the ASCI program and elsewhere. In this context, the authors analyze two problem sizes. Their model shows that on the largest such problem (1 billion cells), interprocessor communication performance is not the bottleneck. Single-node efficiency is the dominant factor.