Tiling optimizations for 3D scientific computations
Proceedings of the 2000 ACM/IEEE conference on Supercomputing
Blocking and array contraction across arbitrarily nested loops using affine partitioning
PPoPP '01 Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming
Cache-Efficient Multigrid Algorithms
International Journal of High Performance Computing Applications
Impact of modern memory subsystems on cache optimizations for stencil computations
Proceedings of the 2005 workshop on Memory system performance
Chip multiprocessing and the cell broadband engine
Proceedings of the 3rd conference on Computing frontiers
The potential of the cell processor for scientific computing
Proceedings of the 3rd conference on Computing frontiers
Implicit and explicit optimizations for stencil computations
Proceedings of the 2006 workshop on Memory system performance and correctness
Optimization of sparse matrix-vector multiplication on emerging multicore platforms
Proceedings of the 2007 ACM/IEEE conference on Supercomputing
Roofline: an insightful visual performance model for multicore architectures
Communications of the ACM - A Direct Path to Dependable Software
3D finite difference computation on GPUs using CUDA
Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units
Evaluating multi-core platforms for HPC data-intensive kernels
Proceedings of the 6th ACM conference on Computing frontiers
Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems
Proceedings of the 23rd international conference on Supercomputing
Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs
Proceedings of the 23rd international conference on Supercomputing
A view of the parallel computing landscape
Communications of the ACM - A View of Parallel Computing
A Multilevel Parallelization Framework for High-Order Stencil Computations
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Optimized Stencil Computation Using In-Place Calculation on Modern Multicore Systems
Euro-Par '09 Proceedings of the 15th International Euro-Par Conference on Parallel Processing
Autotuning multigrid with PetaBricks
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
A design methodology for domain-optimized power-efficient supercomputing
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
Auto-tuning 3-D FFT library for CUDA GPUs
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
PASTHA: parallelizing stencil calculations in Haskell
Proceedings of the 5th ACM SIGPLAN workshop on Declarative aspects of multicore programming
State-of-the-art in heterogeneous computing
Scientific Programming
Efficient simulation of agent-based models on multi-GPU and multi-core clusters
Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques
Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
Proceedings of the 37th annual international symposium on Computer architecture
Evaluation of streaming aggregation on parallel hardware architectures
Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems
IBM BladeCenter QS22: design, performance, and utilization in hybrid computing systems
IBM Journal of Research and Development
Integrated execution: a programming model for accelerators
IBM Journal of Research and Development
IBM Journal of Research and Development
Proceedings of the 19th international conference on Parallel architectures and compilation techniques
A case for machine learning to optimize multicore performance
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Optimizing collective communication on multicores
HotPar'09 Proceedings of the First USENIX conference on Hot topics in parallelism
Exposing tunable parameters in multi-threaded numerical code
NPC'10 Proceedings of the 2010 IFIP international conference on Network and parallel computing
Introducing the semi-stencil algorithm
PPAM'09 Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
OpenMPC: Extended OpenMP Programming and Tuning for GPUs
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
A language-based tuning mechanism for task and pipeline parallelism
Euro-Par'10 Proceedings of the 16th international Euro-Par conference on Parallel processing: Part II
ACM SIGARCH Computer Architecture News
Eliminating the memory bottleneck: an FPGA-based solution for 3d reverse time migration
Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays
Optimizing and auto-tuning belief propagation on the GPU
LCPC'10 Proceedings of the 23rd international conference on Languages and compilers for parallel computing
Parallel 3D multigrid methods on the STI cell BE architecture
Facing the multicore-challenge
Parallel 3D multigrid methods on the STI cell BE architecture
Facing the multicore-challenge
Data layout transformation for stencil computations on short-vector SIMD architectures
CC'11/ETAPS'11 Proceedings of the 20th international conference on Compiler construction: part of the joint European conferences on theory and practice of software
Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures
Mint: realizing CUDA performance in 3D stencil methods with annotated C
Proceedings of the international conference on Supercomputing
Understanding stencil code performance on multicore architectures
Proceedings of the 8th ACM International Conference on Computing Frontiers
Efficient parallel stencil convolution in Haskell
Proceedings of the 4th ACM symposium on Haskell
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Extracting ultra-scale Lattice Boltzmann performance via hierarchical and distributed auto-tuning
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
Hardware/software co-design for energy-efficient seismic modeling
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
ACM SIGARCH Computer Architecture News
Parallel simulation of dendritic growth on unstructured grids
Proceedings of the first workshop on Irregular applications: architectures and algorithm
CUDA 2d stencil computations for the jacobi method
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Streaming model computation of the FDTD problem
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume Part I
Fast wavelet transform utilizing a multicore-aware framework
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Efficiently implementing monte carlo electrostatics simulations on multicore accelerators
PARA'10 Proceedings of the 10th international conference on Applied Parallel and Scientific Computing - Volume 2
Extendable pattern-oriented optimization directives
CGO '11 Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization
Parameterized micro-benchmarking: an auto-tuning approach for complex applications
Proceedings of the 9th conference on Computing Frontiers
Revisiting finite difference and spectral migration methods on diverse parallel architectures
Computers & Geosciences
Fast seismic modeling and reverse time migration on a graphics processing unit cluster
Concurrency and Computation: Practice & Experience
A survey on hardware-aware and heterogeneous computing on multicore processors and accelerators
Concurrency and Computation: Practice & Experience
ARC'12 Proceedings of the 8th international conference on Reconfigurable Computing: architectures, tools and applications
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters
Proceedings of the Tenth International Symposium on Code Generation and Optimization
Automatic restructuring of GPU kernels for exploiting inter-thread data locality
CC'12 Proceedings of the 21st international conference on Compiler Construction
High-performance code generation for stencil computations on GPU architectures
Proceedings of the 26th ACM international conference on Supercomputing
Accelerating the red/black SOR method using GPUs with CUDA
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Autotuning of adaptive mesh refinement PDE solvers on shared memory architectures
PPAM'11 Proceedings of the 9th international conference on Parallel Processing and Applied Mathematics - Volume Part I
Extendable pattern-oriented optimization directives
ACM Transactions on Architecture and Code Optimization (TACO)
Non-intrusive coscheduling for general purpose operating systems
MSEPT'12 Proceedings of the 2012 international conference on Multicore Software Engineering, Performance, and Tools
The Journal of Supercomputing
Stencil computations on heterogeneous platforms for the Jacobi method: GPUs versus Cell BE
The Journal of Supercomputing
Tiling stencil computations to maximize parallelism
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Optimization of geometric multigrid for emerging multi- and manycore processors
SC '12 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
A script-based autotuning compiler system to generate high-performance CUDA code
ACM Transactions on Architecture and Code Optimization (TACO) - Special Issue on High-Performance Embedded Architectures and Compilers
ASK: adaptive sampling kit for performance characterization
Euro-Par'12 Proceedings of the 18th international conference on Parallel Processing
Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning
International Journal of High Performance Computing Applications
Finite Element Integration on GPUs
ACM Transactions on Mathematical Software (TOMS)
OpenMPC: extended OpenMP for efficient programming and tuning on GPUs
International Journal of Computational Science and Engineering
A peta-scalable CPU-GPU algorithm for global atmospheric simulations
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
Parallel schedule synthesis for attribute grammars
Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming
PDQ: Parallel Distance Queries for deformable meshes
Graphical Models
Split tiling for GPUs: automatic parallelization using trapezoidal tiles
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units
Performance comparison of GPU programming frameworks with the striped Smith-Waterman algorithm
ACM SIGARCH Computer Architecture News - ACM SIGARCH Computer Architecture News/HEART '12
From physics model to results: an optimizing framework for cross-architecture code generation
Proceedings of the Extreme Scaling Workshop
SemCache: semantics-aware caching for efficient GPU offloading
Proceedings of the 27th international ACM conference on International conference on supercomputing
Micro adaptivity in Vectorwise
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
A stencil compiler for short-vector SIMD architectures
Proceedings of the 27th international ACM conference on International conference on supercomputing
Proceedings of the third ACM SIGPLAN X10 Workshop
A scalable, efficient scheme for evaluation of stencil computations over unstructured meshes
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Location-aware cache management for many-core processors with deep cache hierarchy
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Taming parallel I/O complexity with auto-tuning
SC '13 Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
Starchart: hardware and software optimization using recursive partitioning regression trees
PACT '13 Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Efficient 3D stencil computations using CUDA
Parallel Computing
Explicit routing schemes for implementation of cellular automata on processor arrays
Natural Computing: an international journal
High level transforms for SIMD and low-level computer vision algorithms
Proceedings of the 2014 Workshop on Programming models for SIMD/Vector processing
Hi-index | 0.00 |
Understanding the most efficient design and utilization of emerging multicore systems is one of the most challenging questions faced by the mainstream and scientific computing industries in several decades. Our work explores multicore stencil (nearest-neighbor) computations --- a class of algorithms at the heart of many structured grid codes, including PDF solvers. We develop a number of effective optimization strategies, and build an auto-tuning environment that searches over our optimizations and their parameters to minimize runtime, while maximizing performance portability. To evaluate the effectiveness of these strategies we explore the broadest set of multicore architectures in the current HPC literature, including the Intel Clovertown, AMD Barcelona, Sun Victoria Falls, IBM QS22 PowerXCell 8i, and NVIDIA GTX280. Overall, our auto-tuning optimization methodology results in the fastest multicore stencil performance to date. Finally, we present several key insights into the architectural tradeoffs of emerging multicore designs and their implications on scientific algorithm development.