GPU Cluster for High Performance Computing
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
SC '05 Proceedings of the 2005 ACM/IEEE conference on Supercomputing
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Discrete-event Execution Alternatives on General Purpose Graphical Processing Units (GPGPUs)
Proceedings of the 20th Workshop on Principles of Advanced and Distributed Simulation
Sparse matrix solvers on the GPU: conjugate gradients and multigrid
SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Linear algebra operators for GPU implementation of numerical algorithms
SIGGRAPH '05 ACM SIGGRAPH 2005 Courses
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Implementing Wilson-Dirac operator on the cell broadband engine
Proceedings of the 22nd annual international conference on Supercomputing
Scalable parallel programming with CUDA
ACM SIGGRAPH 2008 classes
Entering the petaflop era: the architecture and performance of Roadrunner
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Evaluation of streaming aggregation on parallel hardware architectures
Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
This work implements a computationally expensive chemical kinetics kernel from a large-scale community atmospheric model on three multi-core platforms: NVIDIA GPUs using CUDA, the Cell Broadband Engine, and Intel Quad-Core Xeon CPUs. A comparative performance analysis for each platform in double and single precision on coarse and fine grids is presented. Platform-specific design and optimization is discussed in a mechanism-agnostic way, permitting the optimization of many chemical mechanisms. The implementation of a three-stage Rosenbrock solver for SIMD architectures is discussed. When used as a template mechanism in the the Kinetic PreProcessor, the multi-core implementation enables the automatic optimization and porting of many chemical mechanisms on a variety of multi-core platforms. Speedups of 5.5x in single precision and 2.7x in double precision are observed when compared to eight Xeon cores. Compared to the serial implementation, the maximum observed speedup is 41.1x in single precision.