Gyrokinetic particle simulation model
Journal of Computational Physics
14.9 TFLOPS three-dimensional fluid simulation for fusion science with HPF on the Earth Simulator
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Proceedings of the 2002 ACM/IEEE conference on Supercomputing
Performance enhancement strategies for multi-block overset grid CFD applications
Parallel Computing - Special issue: Parallel and distributed scientific and engineering computing
Evaluating support for global address space languages on the Cray X1
Proceedings of the 18th annual international conference on Supercomputing
Floating-point sparse matrix-vector multiply for FPGAs
Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
Scientific Computations on Modern Parallel Vector Systems
Proceedings of the 2004 ACM/IEEE conference on Supercomputing
Fast Parallel Non-Contiguous File Access
Proceedings of the 2003 ACM/IEEE conference on Supercomputing
Implicit and explicit optimizations for stencil computations
Proceedings of the 2006 workshop on Memory system performance and correctness
An on-chip cache design for vector processors
MEDEA '07 Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture
IWOMP '07 Proceedings of the 3rd international workshop on OpenMP: A Practical Programming Model for the Multi-Core Era
A shared cache for a chip multi vector processor
Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture
Performance evaluation of NEC SX-9 using real science and engineering applications
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Performance evaluation of scientific applications on modern parallel vector systems
VECPAR'06 Proceedings of the 7th international conference on High performance computing for computational science
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
A performance evaluation of the cray x1 for scientific applications
VECPAR'04 Proceedings of the 6th international conference on High Performance Computing for Computational Science
Performance characteristics of a cosmology package on leading HPC architectures
HiPC'04 Proceedings of the 11th international conference on High Performance Computing
Implications of memory performance for highly efficient supercomputing of scientific applications
ISPA'06 Proceedings of the 4th international conference on Parallel and Distributed Processing and Applications
Optimization of geometric multigrid for emerging multi- and manycore processors
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
The growing gap between sustained and peak performance for scientific applications is a well-known problem in high end computing. The recent development of parallel vector systems offers the potential to bridge this gap for many computational science codes and deliver a substantial increase in comput-ing capabilities. This paper examines the intranode performance of the NEC SX-6 vector processor and the cache-based IBM Power3/4 superscalar architectures across a number of scientific computing areas. First, we present the performance of a microbenchmark suite that examines low-level machine characteristics. Next, we study the behavior of the NAS Parallel Benchmarks. Finally, we evaluate the performance of several scientific computing codes. Results demonstrate that the SX-6 achieves high performance on a large fraction of our applications and often significantly outperforms the cache-based architectures. However, certain applications are not easily amenable to vectorization and would require extensive algorithm and implementation reengineering to utilize the SX-6 effectively.