Parallel solution of triangular systems on distributed-memory multiprocessors
SIAM Journal on Scientific and Statistical Computing
In search of clusters: the coming battle in lowly parallel computing
In search of clusters: the coming battle in lowly parallel computing
LogP: a practical model of parallel computation
Communications of the ACM
The parallel execution of DO loops
Communications of the ACM
The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors
The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors
Predictive performance and scalability modeling of a large-scale application
Proceedings of the 2001 ACM/IEEE conference on Supercomputing
POEMS: End-to-End Performance Design of Large Parallel Adaptive Computational Systems
IEEE Transactions on Software Engineering
Performance Modeling of Distributed Hybrid Architectures
IEEE Transactions on Parallel and Distributed Systems
Performance modeling of deterministic transport computations
Performance analysis and grid computing
Verifying large-scale system performance during installation using modelling
High performance scientific and engineering computing
A Performance and Scalability Analysis of the BlueGene/L Architecture
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Will Moore's Law Be Sufficient?
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
An empirical performance analysis of commodity memories in commodity servers
MSP '04 Proceedings of the 2004 workshop on Memory system performance
Parallel Simulation of Large-Scale Parallel Applications
International Journal of High Performance Computing Applications
A General Performance Model of Structured and Unstructured Mesh Particle Transport Computations
The Journal of Supercomputing
A Performance Model of the Parallel Ocean Program
International Journal of High Performance Computing Applications
A performance model of non-deterministic particle transport on large-scale systems
Future Generation Computer Systems
Performance feature identification by comparative trace analysis
Future Generation Computer Systems
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Proceedings of the 2006 ACM/IEEE conference on Supercomputing
Designing a Cluster for Your Application
Computing in Science and Engineering
Entering the petaflop era: the architecture and performance of Roadrunner
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
A performance model of non-deterministic particle transport on large-scale systems
Future Generation Computer Systems
Performance feature identification by comparative trace analysis
Future Generation Computer Systems
A performance model of non-deterministic particle transport on large-scale systems
ICCS'03 Proceedings of the 2003 international conference on Computational science: PartIII
A look at application performance sensitivity to the bandwidth and latency of infiniband networks
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
Dynamic performance prediction of an adaptive mesh application
IPDPS'06 Proceedings of the 20th international conference on Parallel and distributed processing
GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method
Journal of Computational Physics
Automatic generation of executable communication specifications from parallel applications
Proceedings of the international conference on Supercomputing
Predictive analysis of a hydrodynamics application on large-scale CMP clusters
Computer Science - Research and Development
Auto-generation of communication benchmark traces
Proceedings of the second international workshop on Performance modeling, benchmarking and simulation of high performance computing systems
Performance modeling: understanding the past and predicting the future
Euro-Par'05 Proceedings of the 11th international Euro-Par conference on Parallel Processing
Optimizing sweep3d for graphic processor unit
ICA3PP'10 Proceedings of the 10th international conference on Algorithms and Architectures for Parallel Processing - Volume Part I
ScalaExtrap: Trace-based communication extrapolation for SPMD programs
ACM Transactions on Programming Languages and Systems (TOPLAS)
PAS2P tool, parallel application signature for performance prediction
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Auto-generation of communication benchmark traces
ACM SIGMETRICS Performance Evaluation Review
Performance analysis of an optical circuit switched network for peta-scale systems
Euro-Par'07 Proceedings of the 13th international Euro-Par conference on Parallel Processing
Parallel algorithms for Sn transport sweeps on unstructured meshes
Journal of Computational Physics
Elastic and scalable tracing and accurate replay of non-deterministic events
Proceedings of the 27th international ACM conference on International conference on supercomputing
Using automated performance modeling to find scalability bugs in complex codes
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
C2FPGA-A dependency-timing graph design methodology
Journal of Parallel and Distributed Computing
Hi-index | 0.01 |
The authors develop a model for the parallel performance of algorithms that consist of concurrent, two-dimensional wavefronts implemented in a message-passing environment. The model, based on a LogGP machine parameterization, combines the separate contributions of computation and communication wavefronts. The authors validate the model on three important supercomputer systems, on up to 500 processors. They use data from a deterministic particle transport application taken from the ASCI workload, although the model is general to any wavefront algorithm implemented on a 2-D processor domain. They also use the validated model to make estimates of performance and scalability of wavefront algorithms on 100 TFLOPS computer systems expected to be in existence within the next decade as part of the ASCI program and elsewhere. In this context, the authors analyze two problem sizes. Their model shows that on the largest such problem (1 billion cells), interprocessor communication performance is not the bottleneck. Single-node efficiency is the dominant factor.