Fortran at ten gigaflops: the connection machine convolution compiler
PLDI '91 Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation
Parallel methods and tools for predicting material properties
Computing in Science and Engineering
Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
MPI-The Complete Reference, Volume 1: The MPI Core
MPI-The Complete Reference, Volume 1: The MPI Core
Using Time Skewing to Eliminate Idle Time due to Memory Bandwidth and Network Limitations
IPDPS '00 Proceedings of the 14th International Symposium on Parallel and Distributed Processing
Automatic Blocking of Nested Loops
Automatic Blocking of Nested Loops
Cache oblivious stencil computations
Proceedings of the 19th annual international conference on Supercomputing
Implicit and explicit optimizations for stencil computations
Proceedings of the 2006 workshop on Memory system performance and correctness
Entering the petaflop era: the architecture and performance of Roadrunner
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures
Proceedings of the 2008 ACM/IEEE conference on Supercomputing
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
Data layout transformation for stencil computations on short-vector SIMD architectures
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
ICCSA'11 Proceedings of the 2011 international conference on Computational science and its applications - Volume Part II
The Journal of Supercomputing
ASK: adaptive sampling kit for performance characterization
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Performance-reliability tradeoff analysis for multithreaded applications
DATE '12 Proceedings of the Conference on Design, Automation and Test in Europe
International Journal of High Performance Computing Applications
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Hi-index | 0.00 |
Stencil based computation on structured grids is a common kernel to broad scientific applications. The order of stencils increases with the required precision, and it is a challenge to optimize such high-order stencils on multicore architectures. Here, we propose a multilevel parallelization framework that combines: (1) inter-node parallelism by spatial decomposition; (2) intra-chip parallelism through multithreading; and (3) data-level parallelism via single-instruction multiple-data (SIMD) techniques. The framework is applied to a 6 th order stencil based seismic wave propagation code on a suite of multicore architectures. Strong-scaling scalability tests exhibit superlinear speedup due to increasing cache capacity on Intel Harpertown and AMD Barcelona based clusters, whereas weak-scaling parallel efficiency is 0.92 on 65,536 BlueGene/P processors. Multithreading+SIMD optimizations achieve 7.85-fold speedup on a dual quad-core Intel Clovertown, and the data-level parallel efficiency is found to depend on the stencil order.